Andres Mendoza · Growth Data Analyst (Central Hybrid Team) · Analyst Edition
gameplay_narrative_arc: Fail → Win (+24.5%, n=117),
hook: Problem Statement (+20.2%, n=81),
strategic_angle: Cognitive Challenge (+18.3%, n=71). Everything else is noise or worse.Creative performance is not random. Specific tag values consistently lift Lumina, and we can name them. The system uses these patterns to inform briefs — it's a guidance tool, not a forecasting model. Tags explain ~3% of total variance, so a recommendation is "best evidence we have," not a prediction.
| Level | Question it answers | Anchor outputs |
|---|---|---|
| Descriptive | What happened? | creative_level_analysis, insights_tag_*, tag_metric_correlations, weekly_lumina_d7_trajectory |
| Diagnostic | Why did it happen? | concept_level_analysis, variance_breakdown_summary, tag_execution_difficulty, concept_dna_ranking, dna_difficulty_matrix |
| Prescriptive | What should we do? | tag_swap_recommendations, metric_drivers, creative_decision_facts, lift_cube, suggest.py |
| Strategic | Where should we invest? | production_concentration, underexplored_combinations, weekly_lumina_d7_trajectory, producer_difficulty_analysis |
Lumina is a composite score. D0 measures same-day performance (≤24h after the creative ran); D7 measures mature-cohort performance (≥7 days). Each horizon uses metrics from its own time window — D7 does not borrow D1 retention, because doing so would double-count retention and reward same-day engagement at the D7 layer.
Where:
roas_d7 + retention_d7 are both non-null.Verification: the persisted lumina_score_d7 in creative_level_analysis.csv matches independent recompute to 0.0000 absolute difference across all 916 D7-mature rows.
C444_V4_WW) in one test window.C444, R1617, etc. We have 998 creatives across 236 concepts — about 4 versions per concept on average.
Source: variance_breakdown_summary.csv + variance_decomposition.json (regenerated against corrected 4-component D7 formula; total D7 variance = 528.26). Eta-squared via ANOVA, grouping per-creative D7 scores by orion_concept (n = 789 D7-mature creative-window rows, 236 concepts).
Layer 1 — The concept (40.5% of variance). The biggest single thing predicting whether a creative wins is which idea it belongs to. The framework treats this as the strategic lever: scale concepts that consistently win, kill concepts that consistently lose. This is the Creative Director's call, not the producer's.
Layer 2 — Within-concept differences (~59% combined). Once you fix the concept, what separates V1 from V5? About 3% from tag choice, 2% from producer, the rest is execution (the part the data can't see — script timing, edit rhythm, music drop, etc.). Tags don't dominate this layer because versions of the same concept usually share the same tags on purpose. That's how iteration works.
Layer 3 — Tag patterns across the portfolio (3.3%). When we look across concepts, certain specific tag values systematically lift Lumina (the 3 Reliable Winners — see §5.3). 3.3% sounds small, but it's misleading to read it as "tags don't matter": most of the 100% lives at the concept layer or in unmeasured execution. Tags are 100% of the operational lever a producer can directly turn. They're where the producer's decision lives.
Layer 4 — Why the 4 DNA tags are special. Of the 9 tag categories, 4 carry concept identity (gameplay_narrative_arc, hook, strategic_angle, creative_focus). Changing one of them is essentially trying a different concept. The other 5 (cta_strategy, gameplay_representation, pacing_energy, audio_narration, presenter_layer) vary execution without changing what the concept is. This split is what makes the engine work: hold DNA constant, vary peripherals → controlled experiments. Change DNA → cross-concept exploration.
Bottom line. Concept layer answers "which ideas to test." DNA-tag layer answers "what should the next idea look like." Peripheral-tag layer answers "how do we iterate safely on a working idea." The engine operates at all three. Treating any one as "the answer" misses the structure.
A concept's average Lumina tells you whether the idea works on average. The within-concept standard deviation tells you whether iteration on its versions pays off — high std means the best version beats the worst by a lot, so testing more versions captures real upside. This is the explore-vs-exploit signal at the concept level. Each dot below is one concept (≥3 versions); marker size = number of versions.
| Concept class | Definition | Operational read |
|---|---|---|
| CONSISTENT WINNER | High avg Lumina D7, low within-concept std. Most versions land above the median; the worst version isn't far below the best. | Scale this concept (more spend, more variants in production). Reliable but not teaching us anything new — versions are too similar to extract iteration learnings. |
| HIGH-CEILING BET | High avg Lumina D7, high within-concept std. Best version is much better than worst — iteration captures real upside. | Keep producing more versions; the marginal version has high expected value because the spread between V1 and V_best is wide. |
| AVERAGE PERFORMER | Mid avg, mid std. Neither obviously winning nor failing. | Hold; revisit if data sharpens. Not the best place to invest the next iteration cycle. |
| NEEDS ANALYSIS | Mixed signals; classification rules don't put it cleanly in any other bucket. | Manual review — check whether the concept is genuinely ambiguous or whether it has insufficient data (≥3 versions but unstable scoring). |
| LOTTERY EFFECT | Low avg Lumina D7 but high within-concept std. Most versions underperform; one or two outliers carry the average. | Don't double down on this concept — the wins are random, not replicable. Investigate whether the outlier had a unique execution that's worth porting elsewhere. |
| STABLE UNDERPERFORMER | Low avg Lumina D7, low within-concept std. Reliably bad. Versions cluster around a low ceiling. | Kill candidate. No iteration upside, no outlier hope. Reallocate the production budget. |
| Single-version + benchmark concepts get separate labels (Single Version Low/Average/High; BENCHMARK – Stable High / Variable High / Variable Average / Underperforming) and are excluded from the chart below because std is undefined or non-comparable. | ||
Source: concept_level_analysis.csv (filtered to concepts with ≥3 versions, benchmarks excluded). Quadrant reading: top-right = HIGH-CEILING BETS; top-left = LOTTERY EFFECTS; bottom-right = CONSISTENT WINNERS; bottom-left = STABLE UNDERPERFORMERS. Median Lumina and median std reference lines split the quadrants.
Mutual information between each tag category and orion_concept, computed on de-duplicated unique creatives. Higher MI = the tag does more work distinguishing one concept from another. The top 4 tags cover 61% of the total — these are the concept's structural DNA. The other 5 are peripheral execution choices: cosmetics that vary across versions of the same concept without changing what the concept fundamentally is.
Source: concept_dna_ranking.csv built via sklearn.feature_selection.mutual_info_classif
on 998 de-duplicated creatives. Concept entropy = 7.54 bits.
Each subsequent section is one of the 4 levels. Within each section, charts come from the CSVs listed in the appendix. Every recommendation has an evidence tier — see Section 9 for definitions.
The dataset spans 998 unique creatives across 236 concepts, 22 producers, and 24 test windows. Of these, 916 have a computed Lumina D7 score.
Source: weekly_lumina_d7_trajectory.csv. Hovers above ~37–41 across the period. No clear improvement trend over 23 windows — consistent with the "net learning ≈ zero" finding.
For each tag category, six side-by-side views in two rows. Row 1: (1) share of production — how often each value is used (currently equivalent to share of spend, since auto-test creatives receive equal spend); (2) average IPM — installs per 1000 impressions; (3) average CPI — cost per install (lower is better). Row 2: (4) average ROAS D7 — return on ad spend at day 7; (5) average Retention D7; (6) average Lumina D7 — the composite north-star. Bars colored green or blue when the value is in the favorable direction vs the production-weighted overall average; grey otherwise. Dotted line = overall average. CPI is colored inversely (lower is better).
DNA tags carry concept identity; peripheral tags are execution choices. Read the DNA tag charts as "if you change this, you're effectively testing a different concept"; read the peripheral tag charts as "safer within-concept tweaks."
Every concept is classified into one of the categories below based on its mean Lumina, variance, sample size, and ceiling. This is the explore-vs-exploit map at the concept level.
Source: concept_level_analysis.csv column concept_class.
15 CONSISTENT WINNERS to scale, 19 HIGH-CEILING BETS to keep iterating,
42 STABLE UNDERPERFORMERS to kill.
The bar chart above shows class size; the boxplot below shows the actual distribution of per-creative Lumina D7 within each class. Read the spread (whiskers and outliers) to see how distinct the classes really are — a CONSISTENT WINNER's median ought to sit clearly above an AVERAGE PERFORMER's, and a LOTTERY EFFECT should have a wider IQR than a STABLE UNDERPERFORMER.
Source: creative_decision_facts.csv. Benchmark and single-version concepts excluded.
Box = IQR, vertical line = median, dashed line = mean.
| Layer | Variance explained (D7) | R² | Significance | Confounding |
|---|---|---|---|---|
| Concept (from variance_decomp) | 40.54% | 0.4054 | p<0.001 | Low |
| Producer (from variance_decomp) | 2.03% | 0.0203 | p<0.01 | High (V=0.31-0.62 with tags) |
| Strategic Angle (from variance_decomp) | 1.57% | 0.0157 | p<0.01 | High (V=0.47-0.59 with other tags) |
| gameplay_narrative_arc (individual) | 2.06% | 0.0206 | p<0.01 | Severe (V=0.33-0.55 vs tags, prod=0.47, window=0.45) |
| strategic_angle (individual) | 0.96% | 0.0096 | p<0.05 | Severe (V=0.39-0.59 vs tags, prod=0.51, window=0.29) |
| gameplay_representation (individual) | 0.51% | 0.0051 | p<0.01 | Severe (V=0.22-0.64 vs tags, prod=0.45, window=0.25) |
| hook (individual) | 0.47% | 0.0047 | p<0.01 | High (V=0.18-0.41 vs tags, prod=0.30, window=0.35) |
| audio_narration (individual) | 0.12% | 0.0012 | p<0.05 | Severe (V=0.30-0.62 vs tags, prod=0.62, window=0.30) |
| presenter_layer (individual) | 0.10% | 0.0010 | p<0.10 | Severe (V=0.27-0.67 vs tags, prod=0.60, window=0.32) |
| pacing_energy (individual) | 0.08% | 0.0008 | p<0.05 | Severe (V=0.18-0.62 vs tags, prod=0.54, window=0.25) |
| cta_strategy (individual) | 0.04% | 0.0004 | p=0.459 | Moderate (V=0.31-0.39 vs tags, prod=0.38, window=0.38) |
| creative_focus (individual) | 0.03% | 0.0003 | p<0.01 | Severe (V=0.29-0.67 vs tags, prod=0.49, window=0.29) |
| All 9 tags combined | 3.29% | 0.0329 | p<0.01 | Massive multicollinearity |
| Top 2 tags combined | 2.30% | 0.0230 | p<0.01 | High redundancy |
| Top 4 tags combined | 2.62% | 0.0262 | p<0.01 | Massive redundancy |
| Unexplained variance | 56.17% | nan | nan | nan |
Source: variance_breakdown_summary.csv. The "Concept" row groups creatives by orion_concept (e.g. C396, R1617) — the creative idea — not by individual creative_id. Confounding via Cramér's V on the categorical pairs.
tag_metric_correlations.lift_pct filtered to metric=lumina_score_d7. Caveat: not concept-controlled — some lift may be concept confounding rather than the tag's own effect.tag_execution_difficulty.success_rate_d7. A tag with success_rate=0.70 means 70% of the creatives that used it were classified as successful; 30% under-performed.tag_execution_difficulty.csv from within-tag std + range spread + sample-size adjustment. Easy/Moderate = the tag delivers consistently; Hard/Very Hard = wide variance, success depends heavily on execution quality. Lower difficulty does NOT mean better — it means more predictable.| Tag | Lumina D7 lift | Success rate | Difficulty | Sample | Significance |
|---|---|---|---|---|---|
| gameplay_narrative_arc: Fail → Win | +24.5% | 73% | Easy | n=117 | Highly Significant (p<0.01) |
| hook: Problem Statement | +20.2% | 64% | Easy | n=81 | Highly Significant (p<0.01) |
| strategic_angle: Cognitive Challenge | +18.3% | 62% | Easy | n=71 | Highly Significant (p<0.01) |
Source: dna_difficulty_matrix.csv joining tag_execution_difficulty.csv,
tag_metric_correlations.csv, and producer_difficulty_analysis.csv.
Each D7-scored creative has a "weakest component" — the term in the Lumina D7 formula that contributed most negatively to its score. Distribution across 916 D7-scored creatives (4-component formula: IPM, ROAS_D7, RET_D7, −CPI):
ROAS_D7 drags 311 creatives (34%)Retention_D7 drags 272 (30%)IPM drags 203 (22%)CPI drags 130 (14%)Composition split by concept class:
For each tag value (rows, n≥30 only), the lift % vs creatives without the tag, across all 8 KPIs (columns). Cells are red→green, centered at 0%; cells annotated with the lift value. ** = p<0.05, * = p<0.10. Read across a row to see whether a tag lifts everything or only one specific KPI; read down a column to see which tags are the strongest levers for a given KPI.
Source: tag_metric_correlations.csv filtered to sample_size ≥ 30.
Note: lifts are univariate (one tag at a time, no concept fixed effects) so they double-count
correlation between confounded tags. Use this view for hypothesis generation, not causal claims.
This is the core diagnostic mechanism. Given an underperforming creative:
creative_decision_facts.csv.weakest_component_d7 column).lift_cube.csv → ranked tag values that lift it.evidence_tier ∈ {HIGH_CONFIDENCE, MEDIUM_CONFIDENCE}.producer_tag_performance for the best executor.producer_difficulty_analysis.tag_swap_recommendations.csv for a within-concept evidence-backed swap (EXPLOIT path).| Driver tag | Lift % | Sample | p-value | Significance |
|---|---|---|---|---|
| gameplay_narrative_arc: Fail → Win | +24.5% | 117 | 0.0000 | Highly Significant (p<0.01) |
| hook: Problem Statement | +20.2% | 81 | 0.0011 | Highly Significant (p<0.01) |
| strategic_angle: Cognitive Challenge | +18.3% | 71 | 0.0076 | Highly Significant (p<0.01) |
| gameplay_representation: Core Loop / Real-gameplay | +16.5% | 28 | 0.0742 | Marginally Significant (p<0.10) |
| hook: Immediate Action | +14.4% | 14 | 0.1947 | Not Significant (p≥0.10) |
Source: metric_drivers.csv filtered to metric=lumina_score_d7.
| Concept | Tag category | Current | Recommended | +D7 Lumina | P(positive) | n versions |
|---|---|---|---|---|---|---|
| E34 | hook | Problem Statement | Social Proof | +65.5 | 100% | 2 |
| C414 | strategic_angle | Progress & Achievement | Relax & Escape | +47.5 | 100% | 3 |
| C519 | gameplay_narrative_arc | Fail → Win | Levels mix | +46.0 | 100% | 7 |
| R1617 | hook | None (No Hook) | Surprise / Unexpected | +39.1 | 98% | 6 |
| E44 | pacing_energy | Rising Tension | High-Energy / Fast | +37.4 | 100% | 6 |
| E44 | gameplay_representation | Exaggerated / Aspirational gameplay | Conceptual | +37.4 | 100% | 6 |
| C450 | gameplay_narrative_arc | Order → Chaos | Fail → Win | +35.1 | 98% | 12 |
| C450 | cta_strategy | None (No CTA) | End-Card Only | +35.1 | 98% | 12 |
Source: tag_swap_recommendations.csv filtered to high-confidence rows.
Note: only 31 of 169 concepts have a high-confidence swap available — the rest don't have enough version diversity for the swap algorithm to fire.
Run suggest.py --creative <id> --window <id> on any creative and you get a full diagnostic. Here's the output for an actual underperformer in concept C504 (a CONSISTENT WINNER concept — meaning C504 versions normally do well, but this specific one didn't):
==============================================================================
CREATIVE: cmmon4dxm09ay0cpnm09lsu4b | Window: 19 | Producer: Jeremy Laplatine
Concept: C504 | Class: CONSISTENT WINNER
D7 Avg: 68.6 (p95), D7 Max: 78.4, D7 Std: 8.3, Versions: 5
------------------------------------------------------------------------------
Lumina D7: 57.4 (concept avg 68.6, -1.4σ vs concept peers)
DECOMPOSITION (sorted by drag, most negative first):
Retention_D7 -0.95 ← WEAKEST
IPM +0.42
ROAS_D7 +0.47
CPI (lower better) +1.06
EXPLOIT — within-concept tag swaps (n=0):
[no high-confidence swaps available for this concept]
EXPLORE — cross-concept lifts on weakest KPI (ret_d7, n=4):
• gameplay_narrative_arc: try 'Order → Chaos'
+28.6% on retention_d7 (p=0.000, n=524, HIGH_CONFIDENCE) [⚠ DNA tag]
Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
→ Best on this tag: Julie Droz (+19.0% D7 lift on this tag, n=45)
→ Best on this tag: Ayca Uyanik (+11.7% D7 lift on this tag, n=17)
• strategic_angle: try 'Relax & Escape'
+11.4% on retention_d7 (p=0.000, n=570, HIGH_CONFIDENCE) [⚠ DNA tag]
Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
→ Best on this tag: Julie Droz (+15.6% D7 lift on this tag, n=51)
→ Best on this tag: Nikola Kachanski (+2.7% D7 lift on this tag, n=231)
• hook: try 'Surprise / Unexpected'
+2.8% on retention_d7 (p=0.019, n=284, HIGH_CONFIDENCE) [⚠ DNA tag]
Difficulty: Moderate | Success rate: 47%
→ Best on this tag: Vira Bilous (+13.2% D7 lift on this tag, n=18)
→ Best on this tag: Nikola Kachanski (+2.5% D7 lift on this tag, n=104)
• hook: try 'None (No Hook)'
+1.7% on retention_d7 (p=0.032, n=525, HIGH_CONFIDENCE) [⚠ DNA tag]
Difficulty: Hard | Success rate: 51% ← HARD, route to specialist
→ Best on this tag: Nikola Kachanski (+5.2% D7 lift on this tag, n=144)
→ Best on this tag: Yevhenii Hrushetskyi (+2.6% D7 lift on this tag, n=34)
==============================================================================
§6.4 is the engine analyzing a single creative. The biweekly deliverable works at a different grain — a concept brief, not a per-creative fix. Briefs ship for new concepts and new versions; nobody re-makes a creative that already shipped.
The bridge: the same logic (decomposition, EXPLOIT/EXPLORE, producer evidence) gets rolled up across the portfolio. For every concept the engine answers three questions:
Apply this to all D7-mature concepts and you get the cycle deliverable: Brief_Backlog_v1.html. One concept block looks like:
─────────────────────────────────────────────────────────────────────────────
C428 [SCALE] CONSISTENT WINNER
5 versions shipped · avg D7 = 53.7 · std = 12.9
─────────────────────────────────────────────────────────────────────────────
Rationale:
C428 is a CONSISTENT WINNER. Goal: hold the proven DNA and systematically
vary ONE peripheral tag per new version. The 3 versions below test the 3
peripheral swaps with positive cross-concept evidence — generating data the
concept currently lacks.
DNA (held constant):
strategic_angle: Relax & Escape
hook: None (No Hook)
creative_focus: Gameplay Mechanic
gameplay_narrative_arc: Order → Chaos
CONCEPT-LEVEL PRODUCER ROUTING:
EXPLOIT Producer-X +N% weighted (n on DNA tags, K/4 covered)
EXPLORE Producer-Y n=K on DNA tags (build capability)
Versions (3):
V1 — None (Silent) audio H-2026-05-05-001
Primary: audio_narration: Music/SFX → None (Silent)
Expected D7 lift: +10.8% [HIGH_CONFIDENCE, n=107]
V2 — Core Loop / Real-gameplay representation H-2026-05-05-002
Primary: gameplay_representation: Exaggerated → Core Loop / Real-gameplay
Expected D7 lift: +6.6% [MEDIUM_CONFIDENCE, n=28]
V3 — Always-On Banner CTA H-2026-05-05-003
Primary: cta_strategy: End-Card Only → Always-On Banner
Expected D7 lift: +2.3% [MEDIUM_CONFIDENCE, n=46]
─────────────────────────────────────────────────────────────────────────────
suggest.py is available for ad-hoc analysis on any specific creative when needed.
Full operational deliverable: outputs_3/Brief_Backlog_v1.html — 13 active concepts (5 SCALE + 5 ITERATE + 3 EXPLORE) + 3 KILL recommendations, ~40 hypotheses total. The concept-level routing block aggregates each producer's track record across the concept's 4 DNA tags; the version-level prescriptions are the same defensible peripheral swaps applied across all SCALE/ITERATE concepts (this structural repetition is the data-generating mechanism for cycle-over-cycle learning).
outputs_3/build_brief_backlog.py.
| Step | What it does | Source / rule |
|---|---|---|
| 1. Concept selection Which concepts get a brief this cycle? |
SCALE = top N by avg D7 in CONSISTENT WINNER class. ITERATE = top N in HIGH-CEILING BET. EXPLORE = top N rows in underexplored_combinations.csv with is_underexplored=True.KILL = bottom N in STABLE UNDERPERFORMER. |
Source: concept_level_analysis.csv + underexplored_combinations.csv.Selection is fully dynamic — no hardcoded lists. Every concept must clear observation_count ≥ MIN_VERSIONS to qualify (excludes single-version concepts where std is undefined). |
| 2. Version prescription What does each new version test? |
Hold the concept's 4 DNA tags constant. Generate one version per defensible peripheral swap. ITERATE concepts also get a 4th version if there's a high-confidence within-concept DNA swap available. | Source: tag_metric_correlations.csv (peripheral filter) + tag_swap_recommendations.csv (DNA swap V4).Defensible swap rule: lift > 0 AND p-value ≤ P_VALUE_MAX AND n ≥ MIN_N_DEFENSIBLE_SWAP. Negative-expected-lift swaps are explicitly excluded. |
| 3. Producer routing Who's the best motion designer for this brief? |
For each concept, aggregate each producer's D7 lift on the concept's 4 DNA tags, weighted by their reps on each tag. EXPLOIT = top 2 by weighted lift. EXPLORE = bottom 2 (capability-building). | Source: producer_tag_performance.csv.Producer must have ≥ MIN_PRODUCER_REPS on at least one of the concept's DNA tags to enter the ranking. Per-version Hard-tag flag fires from dna_difficulty_matrix.csv only when the version's tag is Hard or Very Hard. |
| 4. Hypothesis ID How does each recommendation get tracked? |
Each version recommendation receives a unique sequential ID: H-YYYY-MM-DD-NNN. The Director's brief references the ID, the Creative Producer logs the implemented concept name in the tracker, and the next pipeline cycle's loop-closure script computes predicted-vs-actual. |
Bookkeeping. The tracker schema is the persistent ledger; the cycle_log.csv records each cycle's inputs (data hash) and outputs (artifacts). |
Every threshold that controls the backlog lives at the top of build_brief_backlog.py. Change a number, regenerate, ship. These are not statistical thresholds — they're operational levers the team can adjust as capacity or evidence appetite changes.
| Knob | Current value | What it controls |
|---|---|---|
N_SCALE | 5 | How many CONSISTENT WINNER concepts get SCALE briefs each cycle |
N_ITERATE | 5 | How many HIGH-CEILING BETS get ITERATE briefs |
N_EXPLORE | 3 | How many under-explored DNA recipes get tested |
N_KILL | 3 | How many STABLE UNDERPERFORMERs get deprioritized |
MIN_VERSIONS | 3 | Minimum versions a concept must have to qualify (excludes singleton concepts) |
MIN_RECIPE_MATURE | 2 | Minimum D7-mature creatives a recipe must have to be EXPLORE-eligible |
P_VALUE_MAX | 0.10 | Max p-value for a peripheral swap to be "defensible" |
MIN_N_DEFENSIBLE_SWAP | 15 | Min sample size for a peripheral swap |
MIN_LIFT_DEFENSIBLE_SWAP | 0 | Min Lumina D7 lift % (>0 = positive only) |
MIN_PRODUCER_REPS | 5 | Min reps a producer needs on a DNA tag to enter ranking |
Two known weaknesses worth stating up front:
Every cycle's run prints exact selections to stdout: concepts picked, defensible swaps qualified, CV quantiles used. That's the audit trail.
H-2026-05-05-001, etc.) that the Creative Producer references when shipping the brief. The Producer marks in the tracker which target concept name was assigned to each hypothesis. Two cycles later, when that concept's D7 data has matured, the engine automatically computes predicted-vs-actual and tags the hypothesis as HIT, MISS, or MIXED.
recommendation_outcomes.csv from §9 and graduates the engine from "best-evidence guidance" toward "evidence-validated guidance." This is the system we are explicitly building toward.
For each tag value, the chart shows how its mean Lumina D7 differs from the overall D7 average across the 825 D7-eligible non-benchmark creatives. Red = OVER-INVESTED (high volume + below-average D7 performance — these are the tags consuming most of your production capacity but pulling your average down; the rebalance candidates). Green = WELL-INVESTED. Orange = UNDER-INVESTED (low volume + above-average performance — scale candidates).
strategic_angle: Relax & Escape at 67% of production with -1.2% lift,
hook: Surprise / Unexpected at 34% with -2.2%,
gameplay_narrative_arc: Order → Chaos at 62% with -2.2%, and so on).
Meanwhile the 3 most under-invested tags are exactly the 3 DNA Reliable Winners
identified in §5.3 (Cognitive Challenge, Problem Statement, Core Loop / Real-gameplay).
This is the cleanest-stated reallocation signal in the entire report.
Source: outputs_3/production_concentration_d7.csv — the Lumina D7 version, with benchmark creatives and "No Tag Applied" rows excluded (benchmarks recur across windows and would inflate volume artificially). Built by outputs_3/build_production_concentration_d7.py. NEUTRAL-status tags (medium-volume) are hidden in this chart for clarity; see the CSV for the full list.
9-tag recipes flagged is_underexplored=True by the pipeline (recipe appears far less often than expected) AND with at least one D7-mature creative. Sorted by avg Lumina D7. These are direct EXPLORE candidates.
| Strategic Angle | Hook | Creative Focus | Narrative Arc | n | Avg Lumina D7 | D7 percentile |
|---|---|---|---|---|---|---|
| Relax & Escape | None (No Hook) | Gameplay Mechanic | Order → Chaos | 3 | 73.0 | 100% |
| Destruction & Chaos | Social Proof | Gameplay Mechanic | None (No Arc) | 4 | 66.4 | 99% |
| Relax & Escape | None (No Hook) | Gameplay Mechanic | Order → Chaos | 4 | 66.3 | 97% |
| Explore/Discovery | Surprise / Unexpected | Juicy Effects | Levels mix | 3 | 61.0 | 90% |
| Relax & Escape | Social Proof | Narrative Moment | Steady Win | 3 | 59.2 | 88% |
| Progress & Achievement | Problem Statement | Juicy Effects | Fail → Win | 3 | 57.2 | 86% |
| Progress & Achievement | Surprise / Unexpected | Gameplay Mechanic | Fail → Win | 4 | 56.7 | 83% |
| Progress & Achievement | Unrelated Disruptor | Gameplay Mechanic | Steady Win | 3 | 50.5 | 72% |
Source: underexplored_combinations.csv. Sample sizes are intentionally small (n=3–5) — that's the definition of underexplored.
| Producer | n creatives | Avg Lumina D7 | Classification | p-value vs peers |
|---|---|---|---|---|
| Yevhenii Hrushetskyi | 69 | 48.7 | Above Average | 0.267 |
| Vira Bilous | 25 | 46.7 | Above Average | 0.627 |
| Julie Droz | 62 | 48.5 | Above Average | 0.431 |
| Alexandre de Crozals | 13 | 47.1 | Above Average | 0.629 |
| Nikola Kachanski | 348 | 46.2 | Average | 0.224 |
| Jeremy Laplatine | 224 | 42.4 | Average | 0.071 |
| Ayca Uyanik | 24 | 41.1 | Below Average | 0.308 |
| Ana Narchemashvili | 14 | 37.5 | High Variance / High Ceiling | 0.279 |
Hard Tag Specialists (sorted by lift on hard tags vs peers):
| Producer | Specialization | Avg Lumina D7 (hard tags) | Lift vs peers | Sample |
|---|---|---|---|---|
| Julie Droz | Hard Tag Specialist (limited easy data) | 48.4 | +8.8% | 543 |
| Alexandre de Crozals | Hard Tag Specialist (limited easy data) | 47.5 | +8.7% | 97 |
| Yevhenii Hrushetskyi | Hard Tag Specialist (limited easy data) | 48.5 | +6.7% | 609 |
| Vira Bilous | Insufficient Data | 47.0 | +4.0% | 219 |
| Nikola Kachanski | Insufficient Data | 46.2 | +1.6% | 3105 |
| Jeremy Laplatine | Insufficient Data | 42.5 | -5.9% | 1979 |
| Ayca Uyanik | Insufficient Data | 41.2 | -7.7% | 212 |
| Ana Narchemashvili | Insufficient Data | 32.2 | -23.0% | 94 |
The engine sits on top of three new tables in outputs_3/:
creative_decision_facts.csv — atomic per-(creative, window) table with the 4-component Lumina D7 decomposition (IPM, ROAS_D7, RET_D7, CPI). Joins concept-class. 1,078 rows.lift_cube.csv — (tag × KPI × producer) lookup. 523 rows × 29 cols. Pre-computed evidence tier (HIGH/MEDIUM/EXPLORATORY/INSUFFICIENT_DATA).dna_difficulty_matrix.csv — DNA tag values × difficulty × success rate × specialist routing. 27 rows × 19 cols.Plus the orchestrator: suggest.py. Run as:
| Tier | Rule | Use for |
|---|---|---|
| HIGH_CONFIDENCE | p < 0.05 AND n ≥ 30 | Direct recommendations, action plans |
| MEDIUM_CONFIDENCE | p < 0.10 AND n ≥ 15 | Hypothesis generation, A/B tests |
| EXPLORATORY | weaker | Discussion, qualitative reads |
| INSUFFICIENT_DATA | missing p-value or n | Skip |
recommendation_outcomes.csv is exactly what the hypothesis tracker (§6.6) is built to populate. Each cycle's resolved hypotheses feed the loop.tag_metric_correlations.csv and marginal_effects.csv do NOT control for concept. With concept explaining 40.5% of variance, this matters.marginal_effects.csv) is 0.13. This is a guidance system, not a prediction model.| Analysis | n | Caveat |
|---|---|---|
| Variance decomposition | 789–916 D7 rows | Healthy — well-powered. |
| tag_metric_correlations | varies by tag (some <15) | Use evidence_tier to filter. |
| tag_swap_recommendations | 2–10 versions per swap | Most CIs span zero; filter is_high_confidence=1. |
| Producer × tag lifts | often n=10–25 | Only 8 producers total; peer baseline is small. |
| Concept fixed effects | not computed | Major upgrade opportunity. |
| Inheritance learning | 14% ancestor coverage | "Net learning ≈ zero" finding is suspended-judgment. |
"Divergent / Fake Gampeplay" (n=2, typo) and "Divergent / Fake Gameplay" (n=46) are the same value with different labels.s_created_by_user_role = "Internal" for all 1,078 rows — no internal/external producer dimension is analyzable.strategic_angle ↔ gameplay_narrative_arc = 0.55.tag_metric_correlations.csv and marginal_effects.csv. Will reduce most "tag lift" claims, surface the few that survive — those are the real ones.Every file used in this report, where it lives, and which level uses it:
| File | Path | Level | Purpose |
|---|---|---|---|
creative_level_analysis.csv | outputs_test_v2/ | L1, L3 | Atomic creative-window facts. Source of truth for Lumina D7 + components. |
creative_decision_facts.csv | outputs_3/ | L3 (engine) | Per-creative Lumina decomposition + concept class join. |
lift_cube.csv | outputs_3/ | L3 (engine) | (tag × KPI × producer) → lift lookup with evidence tier. |
concept_dna_ranking.csv | outputs_3/ | L2 | Mutual-info ranking of which tag categories define concept identity. |
dna_difficulty_matrix.csv | outputs_3/ | L2, L3 | DNA tag values × difficulty × success rate × specialist routing. |
variance_breakdown_summary.csv | outputs_test_v2/tag_combinations/ | L2 | What % of D7 variance each layer explains. |
variance_decomposition.json | outputs_test_v2/tag_combinations/producer_analytics/ | L2 | ANOVA eta² for concept / producer / strategic_angle. |
concept_level_analysis.csv | outputs_test_v2/tag_combinations/ | L2 | Concept-class taxonomy (CONSISTENT WINNER, HIGH-CEILING BET...) |
tag_metric_correlations.csv | outputs_test_v2/tag_combinations/ | L1, L3 | (tag × KPI) lift table with p-values, used for chart bars. |
metric_drivers.csv | outputs_test_v2/tag_combinations/ | L3 | Top-5 tag drivers per metric (IPM, ROAS, retention...). |
tag_execution_difficulty.csv | outputs_test_v2/tag_combinations/ | L2 | Per-tag difficulty tier + success rate + ceiling/floor. |
tag_swap_recommendations.csv | outputs_test_v2/tag_combinations/ | L3 | Within-concept tag swaps with 90% CIs. |
underexplored_combinations.csv | outputs_test_v2/tag_combinations/ | L4 | 9-tag recipes with low n + high Lumina percentile. |
production_concentration_d7.csv | outputs_3/ | L4 | Lumina D7 version, benchmarks excluded. Replaces the upstream D0 file. |
production_concentration.csv | outputs_test_v2/tag_combinations/ | L4 (legacy) | Original D0 version — superseded for §7.1 by the D7 file above. |
weekly_lumina_d7_trajectory.csv | outputs_test_v2/tag_combinations/ | L1, L4 | Portfolio-wide Lumina trend, WoW change. |
producer_overview.csv | outputs_test_v2/tag_combinations/producer_analytics/ | L2, L4 | Producer composite_score + p_value vs peers. |
producer_difficulty_analysis.csv | outputs_test_v2/tag_combinations/producer_analytics/ | L3, L4 | Hard Tag Specialists (Julie, Alexandre). |
insights_tag_*.csv | outputs_test_v2/ | L1 | Per-tag-category aggregates (avg_ipm, avg_lumina_d7, n). |