Test → Applovin Scale Analysis v3.2

Methodology & Findings · Data Science Report · Andres Mendoza — Growth Data Analyst (Central Hybrid Team)

Question: Do Facebook auto-test metrics predict Applovin scale? Test data: 2025-10-18 → 2026-04-28 (24 windows) Scale data: Feb–May 2026 partial ($9.57M ALL_ recovered) Generated: 2026-05-12

Audience: data analysts, data scientists, UAM team. This is the canonical methodology + findings report for the test → scale validation project. Every claim is anchored to a CSV in analysis_outputs/; every number is reproducible from RELIABLE_cohort_with_install_share.csv.

📌 Important data caveat: The Applovin spend figures in this report reflect only the creative-level granularity we were able to recover through monthly exports (Feb–May 2026). Earlier months and other granularity levels were not accessible due to dashboard export limitations (50,000-row cap on daily exports). The 453 "scale-only" creatives — including R1426_V1 ($480K) and R1608_V1 ($391K) — were tested in Facebook before our 6-month test window starts (Oct 2025). Their absence in the test side is a data-recovery limitation, not a methodology flaw.

Contents

Executive Summary
Business Problem & Why This Matters
Operational Context — What Shapes the Cohort
Methodological Approach
Data Sources & Cleaning
Outcome Distribution & Why It Shaped the Method
Tests Conducted & Why Each One
Statistical Results
The Lumina Cross-Network Scale Score
Findings Ranked by Reliability
Known Limitations & Caveats
Actionable Recommendations
Appendix: File & Reproduction Index

1 · Executive Summary

669

Reliable cohort (n)

0.77

Best CV AUC achieved

Features in winning composite

0.91

Gini of spend distribution

Headline finding: Lumina D7 has real predictive power for Applovin scaling — and it becomes visible once we control for test-phase install volume. The raw correlation with scale is near zero (ρ = +0.02), but after controlling for installs the partial correlation rises to ρ = +0.12 (p = 0.001) — the strongest clean signal of any feature. A 5-feature composite (Lumina D0 + Lumina D7 + log(IPM) + log(ROAS D7) + install_share_in_window) reaches cross-validated AUC = 0.77 on the reliable cohort (n=669, 5-fold CV).

The install-volume effect (Section 8.4): Under our auto-test design, every creative gets the same daily budget (~$15–$20/day for ~4 days), so all creatives compete with equal spend. Within that fixed envelope, the Facebook algorithm decides how to allocate impressions — and a more-efficient creative (low CPM, high CTR×CVR) mechanically gets more installs from the same dollars. This means install volume in our test data is essentially measuring conversion efficiency at constant spend — the same economic quality that Lumina D7's IPM, CPI, and ROAS_D7 components also capture. As a result, Lumina D7 itself is correlated with installs (ρ = +0.60), and the absolute-volume information partly masks Lumina's cohort-relative ranking signal. Two recommended composites are offered in Section 9: a high-AUC (0.77) version that leverages all signals, and a clean-causal (0.65–0.70) version that controls for installs explicitly so the coefficients have unambiguous interpretations.

Five sentences for stakeholders

We analyzed 669 creatives that ran on both the Facebook auto-test pipeline (with mature D7 Lumina scores) and Applovin (with real impressions, clicks and spend across 4 months).
Applovin spend is extremely Pareto-distributed — top 1% of creatives captures 74% of recovered spend — so we used a top-decile binary classification (≥$2,820 spend) as the cleanest outcome.
Lumina D7 does carry predictive signal for Applovin scaling, but it is partially hidden by a sample-size confound: Facebook auto-test allocates more impressions to creatives that perform well early, so IPM/ROAS/installs all move together. After controlling for install volume, Lumina D7's partial correlation with scale rises from ρ = +0.02 to ρ = +0.12 (p = 0.001).
A 5-feature composite reaches AUC 0.77 — combining per-window-normalized Lumina with raw-level economic metrics and audience reach. A cleaner-causal alternative using only Lumina D0 + Lumina D7 + install_share + log(installs) reaches AUC 0.70 with unambiguous coefficient signs.
The recommendation is to publish the Lumina Cross-Network Scale Score alongside Lumina in the dashboard, not to modify Lumina itself; Lumina remains the right tool for within-cohort Facebook ranking.

2 · Business Problem & Why This Matters

What we wanted to answer

The Lumina Score is the team's north-star metric for ranking creatives in Facebook auto-test campaigns. The score is well-defined for its original purpose: which creative outperformed its window-mates in the same Facebook test cohort?

But the operationally meaningful question for the UAM team is different: "if a creative scored well on the Facebook test, does it actually scale on Applovin?" Applovin is where most of the production spend happens — Lumina's value rises if it predicts Applovin scaling, and falls if it doesn't.

Why this matters financially (with recovered-data caveat)

What's in this chart: These are the creative-level monthly spend totals we were able to recover from Applovin via the monthly export (Feb 2026 – May 10 2026, partial month). Earlier months (and finer time slicing) were not accessible at the creative level. Total recovered: $9.57M ALL_ spend across 1,227 unique ALL_ creatives. The April spike to $5.24M is real, but is also partially a function of when we started having visibility — older spend on long-running creatives is folded into "lifetime" cumulative numbers from the aggregated CSV, not visible in this monthly cut.

ALL_ creative spend on Applovin grew sharply from $950K (Feb) to $5.24M (Apr) in our recovered window, driven mostly by a few breakout creatives. The UAM team selects creatives for Applovin partly based on Facebook test signals. If that signal is weak, the team is essentially picking by intuition; if it's strong, the pipeline is justified.

Why this is harder than a typical correlation question

Selection bias (see Section 3 — confirmed with data, not just hypothesized).
Extreme outcome distribution: Pareto-on-steroids (Gini 0.91). 6 creatives carry 74% of all recovered spend.
Tiny upper tail: only 17 creatives reach $10K+ in recovered Applovin spend. Insufficient sample size for top-tier modeling.
Historical blind spot: 453 high-scale Applovin creatives (incl. R1426_V1 $480K, R1608_V1 $391K) were never in the auto-test pipeline — they're pre-October-2025 historical winners, confirmed.
Cross-platform translation: Lumina is calibrated for Facebook; Applovin's algorithm rewards different signals than Facebook's.

3 · Operational Context — What Shapes the Cohort We're Modeling

Before interpreting predictive results, we examine which test-phase signals are systematically different between the creatives that reached Applovin and those that did not. This is not about evaluating any team's process — it is about characterizing the population our model is trained on, so that downstream conclusions (especially about Lumina) are read in context.

📌 Important data-availability constraint: Applovin spend data is only available from February 2026 onward. The Facebook test windows span October 2025 – April 2026. This creates an asymmetric observation window:

Test windows 1–16 (Oct 2025 – Feb 2026): we cannot know whether a creative was activated at the time, only whether it is still active 1–5 months later. This is a survivorship measure, not a propagation-rate.
Test windows 17–24 (Feb 27 – Apr 28, 2026): Applovin observation is concurrent with test end. Here "active on Applovin" is a fair proxy for "propagated to Applovin."

The temporal analysis below is restricted to the concurrent windows (17–24).

3.1 — Comparing test-phase metrics: propagated vs not-propagated cohorts

This snapshot comparison is valid for the full cohort (it does not depend on timing). We display the percentage difference between medians on a unified scale to make small but real operational differences visible — raw values on different units obscure the pattern.

Test-phase metric	Propagated (median)	Not propagated (median)	Δ %	Mann-Whitney p	Reading
Test installs (volume)	74	63	+17%	0.019 ✓	Larger-audience creatives advance
CPI ($)	$0.39	$0.45	−14%	0.020 ✓	Cheaper-acquisition creatives advance
CTR (raw)	0.881	0.848	+4%	0.007 ✓	Higher-engagement creatives advance
IPM	1.619	1.539	+5%	0.572	Directional only
ROAS D7 (raw)	0.015	0.015	−1%	0.415	No difference
ROAS D0 (raw)	0.003	0.003	−1%	0.587	No difference
Lumina D0 percentile-in-window	50.8	48.5	+5%	0.999	No difference (see note)
Lumina D7 percentile-in-window	49.4	52.4	−6%	0.402	No difference (see note)

Why a fairer Lumina comparison was needed: Lumina scores are per-window normalized to 0–100. The median Lumina score in any window is always near 50 by construction — so comparing raw Lumina medians (42.6 vs 42.9) cannot show real differences. Instead we use the percentile rank of each creative within its own test window, which preserves the relative position information. Even with that fairer view, the medians remain near 50 with p > 0.40 for both D0 and D7 — propagation is not stratifying creatives by their within-window Lumina ranking in either direction.

How to read this section in plain terms:

Three concrete operational metrics — test installs (volume), CPI, and CTR — are statistically different between the propagated and not-propagated cohorts at p < 0.02. Creatives that reached Applovin tended to have more test installs, lower acquisition cost, and slightly higher click-through.
The Lumina composite scores themselves do not show a statistically distinguishable difference between the two cohorts (p = 0.40–0.99) even after the fairer percentile-rank framing.
This is operational context, not a judgment: the natural triggers for advancing a creative ("it had volume, it was cheap to acquire users on, people clicked") are visible in the data; the relative composite ranking is not the dominant trigger.
Implication for the modelling that follows: because the cohort we observe was shaped partly by these operational signals (not strictly by Lumina rank), univariate AUC for Lumina alone underestimates its potential value on a less-filtered population. The multivariate composite in Section 9 partially compensates by including the same operational signals (IPM, ROAS, install share) that the propagation process is already implicitly using.

3.2 — Propagation rate by Lumina quartile, concurrent windows only

Restricted to windows 17–24 (Feb–Apr 2026), where Applovin observation overlaps with test end. n=253. This is the data-honest slice for a direct propagation-rate measurement.

Lumina D0 quartile (within window)	n	Propagated	Propagation rate
Q1 (low)	59	43	72.9%
Q2	57	45	78.9%
Q3	56	42	75.0%
Q4 (high)	53	41	77.4%
Chi-square test (D0 quartile × propagated)	chi² = 0.67, p = 0.880 — quartile and propagation status are independent

Concurrent-window finding: Propagation rate is essentially flat across Lumina D0 quartiles (72.9% → 78.9% → 75.0% → 77.4%, chi² p = 0.88). Within this honest temporal slice, Lumina D0 quartile does not statistically stratify whether a creative reached Applovin. Same result for D7 (chi² p = 0.57). This is consistent with Section 3.1: the operational signals (CPI, CTR, installs) — not the Lumina composite — are the variables that differentiate the propagated cohort from the not-propagated cohort.

3.3 — Inventory: top-quartile-Lumina creatives that were not activated on Applovin

A direct artifact of the propagation pattern: which creatives ranked in the top quartile of their test window (by Lumina D0 or D7) but were not activated on Applovin? Below is the per-window count — useful as a candidate-audit list for retrospective review.

Top-Q4 D0 creatives never on Applovin

191

Total Top-Q4 D0 creatives in cohort

Top-Q4 D7 creatives never on Applovin

21%

% of top-Q4 D0 that never reached Applovin

40 of 191 top-Q4-D0 creatives (21%) were never activated on Applovin. Five of them scored a perfect Lumina D0 = 100 (top of their window). The full inventory is in Q4_D0_not_picked_inventory.csv and Q4_D7_not_picked_inventory.csv. Top 15 examples:

Window	Token	Lumina D0	Lumina D7	Installs
5	`C423_V10_WW`	100.00	92.31	110
7	`C444_V4_WW`	100.00	48.48	93
9	`C450_V6_WW`	100.00	100.00	79
11	`C462_V7_WW`	100.00	79.65	128
16	`C491_V8_WW`	100.00	88.63	78
18	`E41_V1_WW`	99.32	46.96	91
21	`C523_V3_WW`	96.80	98.78	53
16	`C491_V5_WW`	90.54	91.06	132
16	`C486_V2_WW`	81.29	68.15	63
19	`C511_V9_WW`	80.05	100.00	116
5	`C423_V11_WW`	79.85	84.55	85
7	`C444_V6_WW`	79.53	53.00	123
19	`C511_V8_WW`	77.36	77.97	48
5	`R1617_V6_WW`	73.35	80.87	76
5	`C422_V7_WW`	72.92	84.44	101

Caveat on the inventory: Some of these creatives from early windows (1–16) may have been activated at the time and killed before Feb 2026 — we cannot rule that out due to the Applovin observation window. For late-window examples (17–24), there is no early-Applovin censoring possibility, so those are genuinely "ranked top-quartile, never reached Applovin" cases. The full inventory is provided so that retrospective review is possible.

3.4 — Implications for the modelling that follows

Because propagation is partly orthogonal to Lumina rank, the cohort of "activated + scaled" creatives we use for modelling is shaped by operational filters (CPI, CTR, installs). This has three consequences for how the results in subsequent sections should be read:

Univariate AUC for Lumina-only models is diluted by the operational filtering in the propagation step — Lumina's true predictive power on a less-filtered population is likely higher than the 0.50 we observe here.
The composite model in Section 9 partially compensates because it includes the same operational signals (IPM, ROAS, install_share) that propagation is implicitly using — giving the score access to both the rank-based view (Lumina) and the absolute-level view.
The 40 "top-quartile-Lumina, not propagated" creatives in Section 3.3 are candidate audit cases, not necessarily mis-decisions. Some may have had operational reasons not visible in the test data; others may genuinely represent untapped scale potential. The inventory exists so that retrospective review can answer this.

4 · Methodological Approach

Test-platform context for readers: The Facebook auto-test we use as our quality signal is a fixed-budget design — every creative in a given test window is allocated the same daily budget (~$15–$20/day for ~4 days, so ~$60–$80 per creative total). The Facebook algorithm decides how to spend that budget for each creative (which audiences, at what CPM), but the dollar envelope is equal. This matters for interpreting our findings: variance in install counts across creatives in a window is variance in conversion efficiency at constant spend, not variance in budget allocation. This nuance becomes important in Section 8.4.

The full analysis ran through 13 sequential approaches, each designed to address a specific weakness of the previous one. The methodology is summarized below; each step's rationale is explained in Section 7.

Step 1 — Build a one-row-per-creative master table

Aggregate Facebook test metrics per creative token (handling benchmarks specially since they appear in multiple windows). Join to Applovin scale via parsed token (e.g., ALL_C396_V4_WW_VID_... → C396_V4_WW).

Step 2 — Filter to genuinely tested cohort

A creative is only "tested on Applovin" if it received impressions AND clicks AND spend. Discovered 27 data anomalies (impressions-no-spend, spend-no-clicks). Strict activity criteria reduces from 1,425 → 1,165 active creatives.

Step 3 — Use mature-only Lumina scores

Lumina D7 is only defined when test data is mature (≥7 days post window end + ≥10 installs at D7). Earlier passes averaged immature scores too, weakening signal.

Step 4 — Choose outcomes by class balance, not arbitrary thresholds

Given Pareto distribution: top-decile binary (≥$2,820 spend, 10% base rate) is the cleanest target. Mega-scale (≥$10K, 2.5% base rate) is too rare for reliable single-feature modeling.

Step 5 — Exhaustive feature subset search

Test every 1, 2, 3, 4, 5, 6, 7-feature combination from 11 test-side metrics (after adding install_share). Optimize for cross-validated AUC (5-fold) to avoid overfitting. Report the smallest combination that matches the full model — parsimony matters for dashboard integration.

Step 6 — Test selection-bias hypotheses with date-aware rigor

Compare picked vs. not-picked test-side distributions (Mann-Whitney U). Apply temporal restriction to the concurrent observation window (windows 17–24) where Applovin and test data overlap. Build per-window inventory of top-Lumina creatives that were never propagated to Applovin.

4.7 — Statistical methods used in this report

This subsection is a self-contained reference for the statistical tests cited throughout — readers can return here when they hit a "CV AUC = 0.77 ± 0.08" or "Spearman ρ = +0.12, p = 0.001" in later sections.

1. AUC — Area Under the (ROC) Curve. AUC measures how well a model rank-orders creatives, not whether it classifies them correctly. Formally: AUC is the probability that, given one randomly-chosen "top-decile" creative and one randomly-chosen "not-top-decile" creative, the model assigns a higher score to the top-decile one.

0.50 = random (coin flip)
0.60 = weak signal · 0.70 = useful · 0.80–0.90 = strong · > 0.90 = exceptional
Our best composite achieves AUC = 0.77 ± 0.08 — substantial signal but not decisive on its own.

Why AUC, not accuracy? For our imbalanced outcomes (10% top-decile, 2.5% mega-scaled), a "predict no" baseline gets 90% accuracy while doing zero useful work. AUC measures rank-ordering — exactly what the UAM team does when prioritizing creatives. They don't classify; they prioritize.

2. Cross-validation (5-fold CV). Every AUC we report is cross-validated: the 669-creative dataset is randomly split into 5 equal parts; the model trains on 4 parts and predicts the held-out 5th; this rotates 5 times so every creative is predicted exactly once by a model that didn't see it during training. The 5 fold-level AUCs are averaged and reported with their standard deviation (e.g., 0.766 ± 0.08). This protects against overfitting and gives a realistic estimate of out-of-sample performance.

3. Spearman's rank correlation (ρ) — our primary correlation measure. Measures the strength of a monotonic relationship between two variables based on their ranks, not raw values.

Range: −1 (perfect inverse rank-order) to +1 (perfect concordant rank-order); 0 = no monotonic relationship.
Why we use it as primary: Applovin spend is Pareto-distributed (Gini 0.91) — Pearson on raw values would be dominated by 6 outlier creatives. Many test metrics (IPM, CPI, ROAS) also have heavy right tails. Ranks neutralize this.
Spearman doesn't assume linearity — it just asks "do these two variables move in the same direction across the cohort?"

4. Pearson's correlation (r) — used alongside Spearman in earlier tables. Measures the strength of a linear relationship on raw values. Reported alongside Spearman as a robustness check; when raw-value Pearson and rank Spearman disagree dramatically, it usually signals outliers. In our data they consistently agreed in direction, with Spearman being more robust.

5. Partial Spearman correlation — the key analysis in Section 8.4.2. Measures the rank correlation between X and Y after removing the linear effect of a third variable Z. In our case: X = test metric, Y = log(Applovin spend), Z = log(test installs).

Computation: (1) rank-transform X, Y, Z; (2) regress X on Z linearly and take residuals; (3) regress Y on Z linearly and take residuals; (4) Pearson-correlate the two residual series. The result isolates the relationship between X and Y that cannot be explained by Z — which is how we separated Lumina D7's true signal (ρ = +0.12, p = 0.001) from its install-volume confound (raw ρ = +0.02).

6. Chi-square test of independence — used for categorical crosstabs. Tests whether two categorical variables (e.g., Lumina D7 quartile × propagated/not-propagated, or Lumina D7 quartile × scale tier) are statistically independent. Reported in Section 3.2 (propagation rates by quartile, p = 0.880) and Section 8.6 (quartile × outcome tables).

7. Mann-Whitney U — used for two-group comparisons. Tests whether the rank distributions of two independent groups differ. Like a t-test, but uses ranks instead of means — robust to skewed distributions and outliers. Used in Section 3.1 ("propagated vs not-propagated" median comparisons) and the winners-vs-non comparisons.

8. p-values throughout. All p-values are two-sided; we report them without correction in the main tables, but flag where multiple-testing correction would be relevant (e.g., the 8.4.2 partial-correlation table tests 7 hypotheses — Bonferroni would multiply each p by 7). The headline finding (Lumina D7 partial ρ = +0.12, p = 0.001) survives Bonferroni correction even on the largest set of tests we ran.

5 · Data Sources & Cleaning

Source	Path	Rows	Notes
Facebook auto-test creatives	`outputs_test_v2/creative_level_analysis.csv`	1,078	998 unique tokens, May 5 2026 build, latest Lumina formula verified to 0.0000 diff
Applovin spend — February	`Downloads/February.csv`	949 ALL_	$950K total — partial visibility
Applovin spend — March	`Downloads/March.csv`	1,095 ALL_	$1.98M total
Applovin spend — April	`Downloads/april.csv`	1,189 ALL_	$5.24M total (55% of all recovered spend)
Applovin spend — May (partial)	`Downloads/may.csv`	1,226 ALL_	$1.40M through May 10

Key cleaning decisions

Decision	Rationale
Token extraction via regex: `ALL_([A-Z]+\d+_V\d+(-\w+)?_[A-Z]+)_VID...`	Matches test-side `s_primary_video_token`. 100% match rate on ALL_ creatives.
Benchmark creatives excluded from main analysis	6 benchmarks accumulate Facebook test data across 17–27 windows each, accounting for 60% of merged Applovin spend ($5.05M). Including them violates "one creative = one test" assumption.
ROAS in % (not dollars)	User-confirmed. A value of 1.49 means 1.49% return on test-phase D7.
453 Applovin-only creatives flagged but excluded	Pre-October-2025 historical winners (`R1426_V1` $480K, `R1608_V1` $391K). Cannot recover their test data without older Facebook auto-test exports.
Strict activity flag: imp>0 AND clicks>0 AND spend>0	Surfaced 27 data anomalies. Loose filters (just spend>0) inflate the "active" population and dilute signal.
Volume-weighted Lumina across mature windows only	Per the Lumina spec, only mature rows have meaningful D7 metrics. Including immature scores in the per-creative average adds noise.
NEW: install_share_in_window computed per (token, window)	Measures how much of the test window's total install volume a creative captured. Volume-weighted across windows for benchmarks. Captures "did this creative attract a large audience relative to its peers."

The strict reliability funnel

Starting from 1,425 unique tokens (test ∪ Applovin), we filter down to a 669-creative cohort that meets every reliability criterion. Every analysis in the results section uses this cohort.

6 · Outcome Distribution & Why It Shaped the Method

Statistic	Value	Interpretation
Median spend	$502	Half of reliable creatives spend less than this
Mean spend	$4,777	9.5× the median → highly skewed
p90 threshold	$2,820	Used as the "top decile" outcome cutoff
Gini coefficient	0.912	More unequal than income distribution in any country (national Gini = 0.40–0.65)
Top 1% creatives	74% of spend	6 creatives carry $2.35M of $3.20M total
Bottom 50% creatives	2.1% of spend	334 creatives split $69K

Tier composition of the reliable cohort (n=669) — corrected presentation

The previous version of this chart used dual y-axes which made it hard to read the actual counts. Below the two views are split apart for clarity.

Tier	n creatives	% of n	Spend	% of spend
MEGA $100k+	4	0.6%	$2,180,286	68.2%
SCALED $10k-100k	13	1.9%	$413,973	13.0%
PROVEN $1k-10k	177	26.5%	$434,492	13.6%
LIGHT $100-1k	366	54.7%	$161,820	5.1%
NOISE <$100	109	16.3%	$5,490	0.2%

The four MEGA creatives ($100K+) account for 68% of all reliable-cohort spend; the 13 SCALED ($10K–100K) add another 13%. Together, 17 creatives carry 81% of the recovered spend; the remaining 652 share $603K (19%). This is why any "what makes a winner" question has a small-sample problem.

7 · Tests Conducted & Why Each One

The analysis ran 13 approaches. Each was motivated by a specific question that arose from the previous step. The honest progression — including dead-ends — is documented here.

#	Approach	Question it answers	Outcome	Status
1	Naive pairwise correlations	Does any test metric correlate with Applovin spend?	All \|ρ\| < 0.15. Most metrics show inverted signs.	Confounded by benchmarks
2	Benchmark-aware aggregation + scale-only audit	Are 6 benchmark creatives skewing the result?	Yes — 60% of merged Applovin spend. Excluded.	Foundational cleanup
3	Log-spaced industry tiers	Do Lumina scores differ across spend tiers?	Visual lift in top 2 tiers (n=17) but stats insignificant.	Sample too small for stats
4	Monthly panel — durability outcomes	Does the test predict sustained spend (months active)?	Retention/ROAS push correct direction for sustainability, wrong for peak.	Suggests 2-axis framing
5	Regularized multivariate model on full cohort	Combined, do the metrics predict scale?	AUC 0.69. Lumina D0 dominant; components inverted.	Strongest aggregate result
6	Two-stage P(picked) × P(scales\|picked)	Does separating "got selected" from "scaled given selected" add lift?	No. AUC 0.67 combined vs 0.68 single.	Dead end
7	Lumina quartile × scale tier crosstab	Is Lumina D7 quartile linked to scale tier?	Chi-square p=0.81 for D7. Independent of scale tier.	Definitive D7 univariate null
8	Install-volume stratification	Does Lumina D7 work for higher-install (less noisy) creatives?	Signal improves from ρ=0 to ρ=+0.09 but never significant.	Median installs too low
9	$10K+ winners inspection	What's different about the winners?	Winners have shorter playtime (p=0.0008) and higher ROAS-growth.	Playtime inversion
10	ROAS-growth hypothesis validation	Does low-D0 + high-ROAS-growth predict scale?	No — candidates scaled at 18.8% vs control 25.8%.	Hypothesis rejected
11	Availability filter (Applovin-active only)	Are high-Lumina creatives just not getting impressions?	No — Q4 D7 has 85% active rate.	Confirmed not artifact
12	Strict reliability cohort + 4-feature composite	What's the best parsimonious model under maximum rigor?	4-feature composite reaches AUC 0.72.	First strong result
13	Install-share + selection-bias proof	Does audience reach help, and is selection bias real?	AUC 0.77 with install_share added. Selection bias confirmed empirically.	FINAL ANSWER

8 · Statistical Results

8.1 — Univariate predictive power (single metric → top decile)

How well does each test metric, used alone, rank creatives by their probability of reaching the top decile of Applovin spend? AUC = 0.5 means random; AUC = 1.0 means perfect.

Key observations:

install_share is the strongest single feature (AUC = 0.65) — creatives that captured a larger share of installs in their test window are more likely to top-decile-scale on Applovin.
log(ROAS D0) and log(ROAS D7) tie next at AUC ≈ 0.63 — but with inverted sign (see Section 8.4 for explanation).
Retention metrics cluster around AUC 0.55–0.59 — directional but inconsistent.
Lumina D7 used alone sits at AUC = 0.50 on the propagated cohort — but this is an observational ceiling shaped by the operational filters that select which creatives reach Applovin (Section 3). It is not a verdict that Lumina is uninformative; in the multivariate model Lumina D7 receives the largest positive coefficient (Section 8.5).
The next-weakest single feature is CTR % (AUC = 0.54). Several other operational signals are also weak in isolation but complementary when combined.
Lumina D0 is also weak alone (AUC = 0.51) — but together with the other features, they add multivariate value.

8.2 — Multivariate AUC progression with install_share included

The 5-feature composite reaches AUC 0.77. Adding features beyond 5 yields no statistically meaningful AUC improvement (gains < 0.005, within cross-validation noise). install_share is in every top-k subset from k=1 to k=7 — it's the most stable contributor.

8.3 — Tested alternative composites (incl. user-suggested maturation forms)

Composite formulation	Univariate or k-feature AUC	Verdict
Lumina D7 alone	0.46	Worse than random
(Lumina D0 + D7) / 2	0.42	Worse — correlated metrics cancel signal
D7 / D0 (maturation ratio)	0.48	Doesn't beat its components
D7 − D0 (additive maturation)	0.51	Better but still random
(D7 − D0) × install_share	0.54	Marginal improvement
(D7 / D0) × install_share (user hypothesis)	0.65	Strong! Driven by install_share term
install_share alone	0.65	Best single feature
4-feat (D0 + D7 + IPM + ROAS_D7)	0.72	Previous best
5-feat (+ install_share)	0.77	NEW best — recommended

The maturation hypothesis is partially validated: the (D7/D0) ratio alone is uninformative (AUC 0.48), but (D7/D0) × install_share matches the best single feature (AUC 0.65). The lift comes mostly from install_share — the maturation ratio adds little on its own but provides a stable rank-based anchor.

8.4 — The install-volume effect: why Lumina D7's true signal is masked

Test-design context (important): Each creative in the auto-test is given the same daily budget envelope (≈$15–$20/day for ~4 days, so ~$60–$80 total per creative). The competition is fair on dollars. What varies is how Facebook's algorithm spends that budget for each creative — which audiences it shows the ad to, at what CPM, with what CTR and conversion rate. A creative whose impressions convert cheaply mechanically accumulates more installs from the same dollars. Install volume in our test is therefore a measure of conversion efficiency at constant spend, not a measure of how much budget the algorithm gave the creative.

Precision note on the Lumina formula: Throughout this section we abbreviate the Lumina components as z(IPM), z(ROAS_D7), etc. The actual implementation applies log(metric + 1) before z-scoring within the window — i.e. each component is z(log(metric + 1)). The log(·+1) transform compresses the heavy right tail of these distributions (a few creatives with very high ROAS or IPM would otherwise dominate the z-score) and handles zero values cleanly (a creative with ROAS_D7 = 0 becomes log(1) = 0 instead of log(0) = −∞). This log step does not change the direction of the install-volume effect described below — z-scoring still removes the absolute-level information, and Lumina's components are still monotonic in their raw metrics. We use the shortened z(metric) notation in the prose for readability.

This section explains the most consequential mechanism we found: Lumina D7 has real predictive power for Applovin scaling, but that power is partially obscured because Lumina's components and absolute install volume are measuring overlapping economic quality. Understanding this is what makes the difference between a model that hits AUC 0.77 by leveraging the overlap and a model that hits the same AUC with coefficients you can defend operationally.

8.4.1 — Why Lumina's components and install volume are not independent

Lumina rewards high-IPM creatives (more installs per impression — a real efficiency signal) and high-ROAS-D7 creatives (long-maturation paying users — a real monetization signal). Both of these intuitions are correct. The complication is that, under our fixed-budget auto-test design, install count and IPM are essentially the same axis: a creative with high IPM mechanically accumulates more installs at the same total spend. Same with CPI: a creative that's cheap to acquire installs at gets more installs. As a result, the four "different" Lumina components (IPM, CPI, ROAS_D7, RET_D7) are not measuring four independent dimensions of quality on this dataset — they're measuring the same underlying conversion-efficiency signal from four highly-correlated angles.

Test metric	Spearman ρ with i_installs	What this tells us
CPI	−0.986	Near-mechanical inverse — high installs definitionally lower CPI
IPM	+0.835	Strongly co-moves: at fixed budget, high-IPM creatives mechanically generate more installs
Lumina D7	+0.596	Lumina rewards IPM/ROAS, so it inherits this volume correlation
ROAS D7	+0.460	Big test samples → more stable D7 ROAS estimates
ROAS D0	+0.383	Same mechanism, smaller magnitude
CTR	−0.045	Independent of install volume
Retention D1 / D7	≈ 0	Pure quality signal — uncorrelated with volume

The bucketed view makes the mechanism concrete. Across 6 install-volume buckets in the reliable cohort:

Install bucket	n	Median installs	Median IPM	Median CPI ($)	Median ROAS D7 (%)	Median Lumina D7
<30 installs	66	24	0.69	$1.24	0.64	20.6
30–50	116	41	1.10	$0.72	1.04	26.9
50–75	170	64	1.45	$0.46	1.31	35.3
75–100	158	87	1.82	$0.33	1.80	47.7
100–150	129	117	2.36	$0.25	2.08	58.9
150+	30	173	3.27	$0.17	2.50	66.8

The gradient from <30 installs to 150+ shows IPM rising from 0.69 to 3.27 (4.7× higher), CPI falling from $1.24 to $0.17, and Lumina D7 climbing from 20.6 to 66.8 in lockstep. Within a single test cohort, the operational metrics and the composite that uses them are largely co-determined by sample size.

8.4.2 — Partial correlations: Lumina D7 emerges as the cleanest signal once installs are controlled for

To separate volume-driven contribution from quality-driven contribution, we compute partial Spearman correlations between each test metric and log(Applovin spend), controlling for log(test installs):

Metric	Raw ρ vs spend	Partial ρ (controlling log installs)	p (partial)	Reading
Lumina D7	+0.022 (p=0.56)	+0.121	0.001 ✓	Emerges as the strongest clean signal
install_share	+0.165	+0.296	<0.0001 ✓	Becomes nearly twice as strong
CPI	+0.141	+0.112	0.004	Inversion shrinks
IPM	−0.154	−0.092	0.017	Inversion shrinks
ROAS D7	−0.140	−0.094	0.015	Inversion shrinks
ROAS D0	−0.094	−0.050	0.193	No longer significant
CTR	−0.075	−0.081	0.036	Stable, small

What this tells us about Lumina D7: the raw correlation between Lumina D7 and Applovin spend looks like nothing (ρ = +0.02). But this is because both Lumina D7 and Applovin spend are mechanically inflated by test install volume — controlling for installs separates the wheat from the chaff. The partial correlation jumps to ρ = +0.12 (p = 0.001), making Lumina D7 the strongest pure-quality signal among single features. Lumina D7 is doing real work; the install confound was hiding it.

8.4.3 — What the negative IPM and ROAS_D7 coefficients are really doing

The negative coefficients on log(IPM) and log(ROAS D7) in the 5-feature composite are not a claim that low-IPM creatives scale better. They are the model performing an algebraic decomposition: extracting the parts of Lumina D7 that come from other components (retention, CPI residual) by subtracting out the parts that come from IPM and ROAS_D7.

Recall the Lumina D7 raw formula, expanded with its actual transforms:

raw_d7 = 1.2·z(log(IPM+1)) + 1.6·z(log(ROAS_D7+1)) + 1.0·z(log(RET_D7+1)) − 1.0·z(log(CPI+1))

The composite uses log(IPM + 1) and log(ROAS_D7 + 1) directly — i.e. the same transform Lumina applies internally, minus the z-score step. That is exactly the point: z(log(metric + 1)) = (log(metric + 1) − μ_window) / σ_window, so the un-z-scored log(metric + 1) carries the absolute-level information that the z-score step removes. When the composite has access to both Lumina D7 and the raw log-transforms, the model can decompose:

β·Lumina_D7 − γ·log(IPM+1) − δ·log(ROAS_D7+1) ≈ retention + CPI parts of Lumina + cohort-rank component

That is: the negative IPM/ROAS coefficients remove the IPM and ROAS pieces from Lumina so that the surviving signal is the retention and CPI residual — plus the pure cohort-rank information that Lumina's z-scoring preserves. Under our fixed-budget test design, log(IPM + 1) carries the absolute conversion-efficiency level, log(ROAS_D7 + 1) carries the absolute payback level, and Lumina D7 carries those plus retention and CPI plus the within-window rank. Subtracting the first two from Lumina isolates the rest. The composite isn't reaching for a different metric than Lumina uses — it's giving the model access to the un-normalized version of the same metric, which carries the level information that the z-score step strips out.

This is verifiable by tracking how the IPM coefficient changes as we add features:

Composite	IPM coefficient	ROAS D7 coefficient	Interpretation
IPM + ROAS_D7 only	−0.06	−0.40	Small marginal effect when alone
+ both Luminas	−0.51	−0.92	Model uses IPM/ROAS to extract non-IPM/non-ROAS parts of Lumina
+ install_share	−0.71	−0.80	Both IPM and share now carry the absolute-level signal
+ log_installs explicitly	−0.18	−0.32	Once installs is its own variable, IPM/ROAS coefficients shrink — confirming they were carrying the install-volume signal

The shrink from −0.71 to −0.18 (IPM) when log(installs) is added explicitly is the cleanest evidence of the mechanism. The IPM coefficient was carrying the absolute-volume signal because nothing else could; once log(installs) is given the job directly, IPM gets to keep just its smaller, true marginal effect.

For the UAM team: the negative IPM/ROAS coefficients in Composite A do not mean "pick low-IPM creatives" or "pick low-ROAS creatives." Higher IPM and higher ROAS D7 in test remain operationally good — they're indicators of better conversion efficiency and payback. The negative signs are the model doing algebra inside the composite to isolate the retention and within-cohort rank signals that Lumina also contains. If this is confusing, use Composite B (Section 9.2) instead — it has all-positive coefficients on the quality features and only one negative coefficient on log(installs), which is an explicit statistical control with a clear interpretation.

8.5 — Standardized coefficients of the recommended composite

8.6 — Lumina quartile rates by outcome

If Lumina were a strong predictor on its own, top-decile rates would rise monotonically Q1→Q4. The data shows weak, non-monotonic patterns — consistent with the selection-bias evidence in Section 3.

Test	Chi-square	p-value	Verdict
Lumina D7 quartile × top-decile	3.60	0.31	Independent (no signal)
Lumina D7 quartile × mega ≥$10K	2.28	0.52	Independent (no signal)
Lumina D0 quartile × sustainability (≥3mo)	7.16	0.067	Marginal — directional only
Lumina D0 quartile × top-decile	2.92	0.40	Independent (no signal)

9 · The Lumina Cross-Network Scale Score — Two Recommended Versions

Given the install-volume mechanism described in Section 8.4, we present two recommended composites. They make different trade-offs between predictive performance and coefficient interpretability — both are deployable, the choice depends on the downstream use case.

9.1 — Composite A: High-AUC version (recommended for ranking)

Formula (CV AUC = 0.766 ± 0.08 on n=635):

P(top-decile Applovin scale) = sigmoid( β₀ + β₁·LuminaD0 + β₂·LuminaD7 + β₃·log(IPM) + β₄·log(ROAS_D7_pct) + β₅·install_share_in_window )

Feature	Type	Standardized coefficient	Direction
Lumina D0 (mature)	Per-window normalized 0–100	+0.023	Higher → more likely top-decile (weak in this composite)
Lumina D7 (mature)	Per-window normalized 0–100	+0.736	Strongest positive contributor
log(IPM)	Raw test-phase level	−0.708	Negative — see Section 8.4 (proxies install-volume confound)
log(ROAS D7 %)	Raw test-phase level	−0.798	Negative — see Section 8.4 (proxies install-volume confound)
install_share_in_window	Window-relative audience reach	+0.628	Strong positive contributor

Use this version when: you need maximum top-decile ranking accuracy and the coefficients will not be shown to stakeholders for direct interpretation. The negative coefficients on IPM and ROAS_D7 are statistical corrections (see Section 8.4) but can be confusing.

9.2 — Composite B: Clean-causal version (recommended for stakeholder explanation)

Formula (CV AUC = 0.701 ± 0.06 on n=669):

P(top-decile Applovin scale) = sigmoid( β₀ + β₁·LuminaD0 + β₂·LuminaD7 + β₃·install_share_in_window + β₄·log(test_installs) )

Feature	Direction	Operational interpretation
Lumina D0 (mature)	+ positive	Reward strong early signal
Lumina D7 (mature)	+ positive	Reward strong matured signal
install_share_in_window	+ positive	Reward audience-fit on Facebook
log(test_installs)	− negative (control)	Statistical control — at fixed test budget, install count carries the absolute conversion-efficiency level. Including it lets the other coefficients carry only quality signal independent of volume.

Use this version when: you need to explain to non-data stakeholders why a creative scored high. All quality signals carry positive coefficients in the natural direction; the only negative coefficient is the install-volume control, which has a clear statistical justification ("we already account for the fact that high-install creatives looked artificially good on Facebook").

9.3 — Comparison of all considered composites

Composite	AUC	Coefficient story
Lumina D7 alone (univariate)	0.50	Hidden by install-volume confound (Section 8.4)
D0 + D7 + install_share	0.65	All positive coefficients; clean story but lower AUC
Composite B (clean-causal)	0.70	All positive on quality; one explicit volume control
Composite A (high-AUC)	0.77	Best ranking; uses negative IPM/ROAS coefs to absorb the confound
Composite A + log_installs (6-feat)	0.76	Marginal — explicit control redundant with implicit absorption

9.4 — Why the volume control matters

The key insight from Section 8.4: Lumina D7 has real predictive power that is largely hidden by the install-volume confound. The two composites above handle this differently:

Composite A uses the negative coefficients on log(IPM) and log(ROAS D7) to implicitly subtract the install-volume contribution. The model is more accurate, but the negative signs invite misreading.
Composite B uses explicit log(test_installs) as a control variable. The remaining coefficients on Lumina and install_share carry only the residual quality signal, with the natural positive sign throughout. Slight AUC cost, big interpretability win.

Recommendation: publish Composite B (clean-causal) as the dashboard-visible Lumina Cross-Network Scale Score for the UAM team to consult. Keep Composite A (high-AUC) as the back-end ranker for any automated prioritization. Both keep Lumina D0/D7 unchanged for within-cohort Facebook ranking — the new scores answer the different question: "given this creative's test result, how likely is it to top-decile-scale on Applovin?"

9.5 — What does NOT work (rejected alternatives)

Alternative composite	AUC	Why not
(Lumina D0 + Lumina D7) / 2	0.42	Too correlated; averaging loses signal
Lumina D7 / Lumina D0 (maturation ratio)	0.48	Ratio of correlated metrics has no marginal info
Lumina D7 − Lumina D0 (additive maturation)	0.51	Marginally better; still near random
(Lumina D7 / D0) × install_share	0.65	Lift comes entirely from install_share term
(Lumina D7 − D0) × install_share	0.54	The subtraction loses information
Heavy-re-weighted Lumina (double IPM/ROAS weights)	0.45	Re-weighting alone doesn't fix the normalization problem

10 · Findings Ranked by Reliability

Tier A Statistically robust (publishable)

Lumina D7 has real predictive power for Applovin scaling, made visible by controlling for test-phase install volume. Raw correlation ρ = +0.02 (hidden by confound); partial correlation after controlling for log(installs) = +0.121 (p = 0.001) — the strongest clean signal in the analysis.
The install-volume mechanism is documented and reproducible. Test-phase IPM (ρ = +0.83), ROAS_D7 (ρ = +0.46), and Lumina D7 itself (ρ = +0.60) all co-vary with i_installs. The mechanism: under our fixed-budget auto-test design (~$15–$20/day per creative for ~4 days), install count is a direct function of conversion efficiency at constant spend — the same economic quality Lumina also captures via IPM, CPI, and ROAS_D7. Once this volume signal is controlled for, the residual signal is overwhelmingly carried by Lumina D7 (cohort-rank component) and install_share.
Two recommended composites:
- Composite A (high-AUC): Lumina D0 + Lumina D7 + log(IPM) + log(ROAS_D7) + install_share — CV AUC = 0.77. Best for ranking.
- Composite B (clean-causal): Lumina D0 + Lumina D7 + install_share + log(installs) — CV AUC = 0.70. Best for stakeholder explanation; all quality coefficients positive.
install_share_in_window is the strongest single observed predictor (univariate AUC = 0.65). After controlling for installs, its partial correlation rises to ρ = +0.30 (p < 0.0001).
Operational filters shape the observed cohort. Creatives that reached Applovin have lower CPI (p=0.020), higher CTR (p=0.007), and higher install volume (p=0.019) than those that did not — but their Lumina percentile-in-window is statistically indistinguishable (p=0.40–0.99). The univariate Lumina AUC of 0.50 is an observational ceiling shaped by this filtering, not a verdict that Lumina is uninformative.
Within-window normalization costs ~10 AUC points cross-platform. Same model on raw features = 0.71; on within-window z-scored features = 0.62. This is consistent with the design intent of Lumina (cohort-fair Facebook ranking) and motivates the composite approach rather than a Lumina re-weighting.
Top-decile scaling and sustainability are distinct signals. Top-decile (AUC 0.77) driven by Lumina + install_share + raw-economic components. Sustainability over 4 months (AUC 0.59) driven by Retention D1.
Playtime inversion: top-decile winners have shorter test-phase playtime (Mann-Whitney p = 0.0008 after duration normalization).

Tier B Suggestive (worth follow-up, needs more data)

Lumina D0 → sustainability (chi-square p = 0.067). Q4 D0 has 67.7% sustained rate vs Q2 D0 = 55.1%.
(D7/D0) × install_share variant (user hypothesis) reaches AUC 0.65 univariately — comparable to install_share alone. The maturation ratio adds little marginal value here but provides a rank-stable anchor; worth retesting once test sample sizes grow.
Q3 Lumina D7 spend concentration: $1.39M (44% of active spend) flows through mid-range D7 creatives, not top-quartile. Driven by 2–3 mega-winners; worth qualitative investigation.

Tier C Investigated and rejected (negative findings worth recording)

ROAS-growth ratio (D7/D0) hypothesis. Validation: candidates scaled at 18.8% vs control 25.8%. Ratio unstable when D0 ROAS is near zero.
Two-stage P(picked) × P(scales|picked) decomposition. Did not add lift over single-stage.
Naive (D0+D7)/2 composite. AUC 0.42–0.52 — worse than either alone.
Mega-scale (≥$10K) standalone prediction. AUC 0.46 with n=17 positives. Too few to model reliably.
Re-weighting Lumina component weights. Doubling IPM/ROAS_D7 weights moves AUC from 0.43 to 0.45 — within noise.
Lumina D7 maturation ratio (D7/D0) alone. AUC 0.48 — uninformative.

11 · Known Limitations & Caveats

Limitation	Impact	Mitigation path
Applovin spend granularity capped	Recovered data is limited to Feb–May 2026 monthly snapshots. Pre-Feb 2026 spend is folded into "lifetime cumulative" totals; we cannot reconstruct earlier monthly slices.	Push for Applovin API access at the creative level for daily/weekly granularity.
Median 72 installs per test creative	D7 ROAS and retention dominated by noise; ratios unstable	Increase test budget per creative; require ≥100 installs for D7-eligibility
17 winners ≥$10K (n_positives too small)	Mega-tier coefficients are unstable; AUC 0.46 with ±0.10 CV variation	Wait 2–3 months for more Applovin scale; pull pre-October-2025 test data
453 scale-only creatives	~12% of ALL_ Applovin spend invisible to test side	Pull pre-October-2025 Facebook auto-test data when available
Lumina rescaled per-window	Discards absolute level info that predicts cross-platform scale	Add complementary raw-level Lumina Cross-Network Scale Score (this analysis's main recommendation)
Single game / single platform	Findings may not generalize to other games or networks	Replicate on second game once pipeline matures
UAM team picks informally — process evolves over time	Selection mechanism is partially unobserved and changes month-to-month	Ask UAM team to log their picks with reasons; treat as labeled dataset
Pareto-distributed outcome (Gini 0.91)	Quantile buckets are meaningless; require continuous or top-decile binary	Stick with continuous log(spend) or top-decile outcomes
Cross-validation variance ±0.08–0.10	Individual AUC values have wide CI on small positive class	Report ± std; don't over-interpret 0.01 differences

12 · Actionable Recommendations

Now (no new data needed)

Add the Lumina Cross-Network Scale Score to the dashboard alongside Lumina D7. Use the 5-feature composite as the basis. Keep Lumina D7 unchanged for its original purpose (within-cohort Facebook ranking).
Brief the UAM team on the selection-bias finding. They've been using CPI + CTR + raw installs — that's not wrong, but it's not Lumina either. The new composite formalizes a signal closer to what's actually predictive.
Build a sustainability score separately. Retention D1 + ROAS predicts which creatives stay active 4+ months (AUC 0.59). Different audience use case from "will this peak-scale."
Investigate the playtime inversion qualitatively. Look at the 17 winners' creative content — what's similar about them?

Next 1–3 months (data accumulation)

Push for finer Applovin spend granularity via API access (the monthly export limit is the binding constraint right now).
Increase test installs per creative — target 200+.
Validate the 5-feature composite prospectively. Score the next 50 test creatives, see if predicted top-deciles actually scale.
Log UAM picks with reasons to make the selection model more learnable.

Longer-term (months 3–6)

Re-fit the composite weights against Applovin scale labels. Once ≥50 $10K+ winners exist.
Pull pre-October-2025 Facebook auto-test data. Would fill in the 453-creative blind spot.
Build an Applovin-side companion score based on Applovin's own D1/D7 metrics.

13 · Appendix: File & Reproduction Index

Master tables (canonical)

File	Purpose	n_rows
`MASTER_creative_activity_table.csv`	All creatives with activity flags	1,426
`RELIABLE_cohort.csv`	Strict cohort: active + D7-mature + non-benchmark	669
`RELIABLE_cohort_with_outcomes.csv`	RELIABLE + 4 outcome flags + monthly spend	669
`RELIABLE_cohort_with_install_share.csv`	v2 main: RELIABLE + outcomes + install_share + composite features	669
`composite_search.csv`	Exhaustive 1/2/3/4/5-feature combination search (v1)	—

Cleaning & joins

File	Purpose
`test_to_applovin_scale_merged_v2.csv`	Primary merged table (n=772) with benchmark-aware aggregation
`scale_monthly_panel.csv`	Per-creative monthly spend matrix (Feb–May)

Models & coefficients

File	Purpose
`univariate_auc_scaled1k.csv`	Single-feature AUC rankings (earlier iteration)
`two_stage_coefs_*.csv`	Stage 1/2A/2B + single-stage coefficients

Crosstabs & winner inspection

File	Purpose
`crosstab_d7_quartile_x_tier.csv`	D7 × scale tier chi-square
`crosstab_d0_quartile_x_tier.csv`	D0 × scale tier chi-square
`winners_inspection_data.csv`	17 ≥$10K winners with all raw metrics

Test → Applovin Scale Analysis v2 · Andres Mendoza · Generated 2026-05-12
Reproducible from RELIABLE_cohort_with_install_share.csv in analysis_outputs/