Creative Intelligence — Creative Learnings & Production Iteration (tag system)

Andres Mendoza · Growth Data Analyst (Central Hybrid Team) · Analyst Edition

Analytical scope: Facebook Auto-test creatives North-star metric: Lumina Score D7 Methodology: 4-level descriptive→strategic framework Generated: from outputs_test_v2/ + outputs_3/
Audience: data analysts and data scientists. This is the canonical reference report — every claim is anchored to a CSV, every number is reproducible, every limitation is stated. A separate business-facing version will be derived from this.

1 · Executive Summary

998
Unique creatives
916
D7-scored
236
Concepts
24
Test windows
22
Producers
58
High-confidence suggestions
4
DNA tag categories
3
Reliable DNA winners
The headline: Concept (the creative idea) explains 40.5% of D7 Lumina variance. All 9 tags combined explain 3.3%. Concept choice is ~12× more predictive than tag choice. The framework treats concept selection as the strategic lever and tags as the operational one — they answer different questions.

Six headline findings

  1. Concept choice dominates. 40.5% of variance comes from the concept (idea). 3.3% from all 9 tags. ~56% is unexplained (execution + noise). Tags are still the only knob a producer directly turns — see §3.2.
  2. 3 specific tag values reliably lift Lumina at p<0.01: gameplay_narrative_arc: Fail → Win (+24.5%, n=117), hook: Problem Statement (+20.2%, n=81), strategic_angle: Cognitive Challenge (+18.3%, n=71). Everything else is noise or worse.
  3. Lumina D7 has 4 components: IPM, ROAS_D7, RET_D7, CPI (inverted). For any underperforming creative we can isolate which component is dragging the score — that's the diagnostic foundation for every recommendation.
  4. 4 of the 9 tag categories define the concept's identity (gameplay_narrative_arc, hook, strategic_angle, creative_focus). They carry 61% of the information distinguishing one concept from another. The other 5 are peripheral — execution choices that vary across versions of the same concept.
  5. Hard Tag Specialists exist: Alexandre de Crozals (+14.0% on hard tags, n=48) and Julie Droz (+10.9%, n=131). For Hard-difficulty tags the engine routes to them.
  6. Iteration is not currently improving creatives. 98 descendants beat their ancestors; 104 underperformed them. Roughly even. Caveat: only 14% of ancestors are in the dataset, so this isn't conclusive.

2 · Business Problem & Approach

Business Problem

Core Thesis

Creative performance is not random. Specific tag values consistently lift Lumina, and we can name them. The system uses these patterns to inform briefs — it's a guidance tool, not a forecasting model. Tags explain ~3% of total variance, so a recommendation is "best evidence we have," not a prediction.

The 4 Layers of Analysis

LevelQuestion it answersAnchor outputs
L1 Descriptive What happened? creative_level_analysis, insights_tag_*, tag_metric_correlations, weekly_lumina_d7_trajectory
L2 Diagnostic Why did it happen? concept_level_analysis, variance_breakdown_summary, tag_execution_difficulty, concept_dna_ranking, dna_difficulty_matrix
L3 Prescriptive What should we do? tag_swap_recommendations, metric_drivers, creative_decision_facts, lift_cube, suggest.py
L4 Strategic Where should we invest? production_concentration, underexplored_combinations, weekly_lumina_d7_trajectory, producer_difficulty_analysis

3 · Framework & Methodology

3.1 The Lumina Formulas (D0 and D7)

Lumina is a composite score. D0 measures same-day performance (≤24h after the creative ran); D7 measures mature-cohort performance (≥7 days). Each horizon uses metrics from its own time window — D7 does not borrow D1 retention, because doing so would double-count retention and reward same-day engagement at the D7 layer.

raw_score_d0 = 1.2·z(log IPM) + 1.6·z(log ROAS_D0) + 1.0·z(log RET_D1) − 1.0·z(log CPI)
raw_score_d7 = 1.2·z(log IPM) + 1.6·z(log ROAS_D7) + 1.0·z(log RET_D7) − 1.0·z(log CPI)

penalized = raw_score × p_installs × p_impressions
lumina_score = (penalized − min) / (max − min) × 100  (per-window rescale for both D0 and D7 — Lumina compares creatives within the same test cohort)

Where:

Verification: the persisted lumina_score_d7 in creative_level_analysis.csv matches independent recompute to 0.0000 absolute difference across all 916 D7-mature rows.

Formula change note. An earlier D7 formula included RET_D1 alongside RET_D7. We removed it because each horizon should use its own retention metric — using both double-counts retention and rewards same-day engagement at the D7 layer. Score impact: rank order is 95% preserved, average shift 3.2 points, max 12. All concept classes and tag aggregates in this report run against the corrected formula.

3.2 The Variance Ceiling — what's actually predictable

Quick clarification on terms:

Source: variance_breakdown_summary.csv + variance_decomposition.json (regenerated against corrected 4-component D7 formula; total D7 variance = 528.26). Eta-squared via ANOVA, grouping per-creative D7 scores by orion_concept (n = 789 D7-mature creative-window rows, 236 concepts).

Why tags still matter even though they only explain 3% — the four layers

Layer 1 — The concept (40.5% of variance). The biggest single thing predicting whether a creative wins is which idea it belongs to. The framework treats this as the strategic lever: scale concepts that consistently win, kill concepts that consistently lose. This is the Creative Director's call, not the producer's.

Layer 2 — Within-concept differences (~59% combined). Once you fix the concept, what separates V1 from V5? About 3% from tag choice, 2% from producer, the rest is execution (the part the data can't see — script timing, edit rhythm, music drop, etc.). Tags don't dominate this layer because versions of the same concept usually share the same tags on purpose. That's how iteration works.

Layer 3 — Tag patterns across the portfolio (3.3%). When we look across concepts, certain specific tag values systematically lift Lumina (the 3 Reliable Winners — see §5.3). 3.3% sounds small, but it's misleading to read it as "tags don't matter": most of the 100% lives at the concept layer or in unmeasured execution. Tags are 100% of the operational lever a producer can directly turn. They're where the producer's decision lives.

Layer 4 — Why the 4 DNA tags are special. Of the 9 tag categories, 4 carry concept identity (gameplay_narrative_arc, hook, strategic_angle, creative_focus). Changing one of them is essentially trying a different concept. The other 5 (cta_strategy, gameplay_representation, pacing_energy, audio_narration, presenter_layer) vary execution without changing what the concept is. This split is what makes the engine work: hold DNA constant, vary peripherals → controlled experiments. Change DNA → cross-concept exploration.

Bottom line. Concept layer answers "which ideas to test." DNA-tag layer answers "what should the next idea look like." Peripheral-tag layer answers "how do we iterate safely on a working idea." The engine operates at all three. Treating any one as "the answer" misses the structure.

Within-concept version variance — where iteration pays off

A concept's average Lumina tells you whether the idea works on average. The within-concept standard deviation tells you whether iteration on its versions pays off — high std means the best version beats the worst by a lot, so testing more versions captures real upside. This is the explore-vs-exploit signal at the concept level. Each dot below is one concept (≥3 versions); marker size = number of versions.

Concept classDefinitionOperational read
CONSISTENT WINNER High avg Lumina D7, low within-concept std. Most versions land above the median; the worst version isn't far below the best. Scale this concept (more spend, more variants in production). Reliable but not teaching us anything new — versions are too similar to extract iteration learnings.
HIGH-CEILING BET High avg Lumina D7, high within-concept std. Best version is much better than worst — iteration captures real upside. Keep producing more versions; the marginal version has high expected value because the spread between V1 and V_best is wide.
AVERAGE PERFORMER Mid avg, mid std. Neither obviously winning nor failing. Hold; revisit if data sharpens. Not the best place to invest the next iteration cycle.
NEEDS ANALYSIS Mixed signals; classification rules don't put it cleanly in any other bucket. Manual review — check whether the concept is genuinely ambiguous or whether it has insufficient data (≥3 versions but unstable scoring).
LOTTERY EFFECT Low avg Lumina D7 but high within-concept std. Most versions underperform; one or two outliers carry the average. Don't double down on this concept — the wins are random, not replicable. Investigate whether the outlier had a unique execution that's worth porting elsewhere.
STABLE UNDERPERFORMER Low avg Lumina D7, low within-concept std. Reliably bad. Versions cluster around a low ceiling. Kill candidate. No iteration upside, no outlier hope. Reallocate the production budget.
Single-version + benchmark concepts get separate labels (Single Version Low/Average/High; BENCHMARK – Stable High / Variable High / Variable Average / Underperforming) and are excluded from the chart below because std is undefined or non-comparable.

Source: concept_level_analysis.csv (filtered to concepts with ≥3 versions, benchmarks excluded). Quadrant reading: top-right = HIGH-CEILING BETS; top-left = LOTTERY EFFECTS; bottom-right = CONSISTENT WINNERS; bottom-left = STABLE UNDERPERFORMERS. Median Lumina and median std reference lines split the quadrants.

3.3 Concept DNA — which tags carry concept identity

Mutual information between each tag category and orion_concept, computed on de-duplicated unique creatives. Higher MI = the tag does more work distinguishing one concept from another. The top 4 tags cover 61% of the total — these are the concept's structural DNA. The other 5 are peripheral execution choices: cosmetics that vary across versions of the same concept without changing what the concept fundamentally is.

Source: concept_dna_ranking.csv built via sklearn.feature_selection.mutual_info_classif on 998 de-duplicated creatives. Concept entropy = 7.54 bits.

3.4 Reading guide

Each subsequent section is one of the 4 levels. Within each section, charts come from the CSVs listed in the appendix. Every recommendation has an evidence tier — see Section 9 for definitions.

L1 4 · Descriptive — What happened?

4.1 Portfolio overview

The dataset spans 998 unique creatives across 236 concepts, 22 producers, and 24 test windows. Of these, 916 have a computed Lumina D7 score.

4.2 Weekly Lumina D7 trajectory

Source: weekly_lumina_d7_trajectory.csv. Hovers above ~37–41 across the period. No clear improvement trend over 23 windows — consistent with the "net learning ≈ zero" finding.

4.3 Per-tag-category performance — multi-metric panels

For each tag category, six side-by-side views in two rows. Row 1: (1) share of production — how often each value is used (currently equivalent to share of spend, since auto-test creatives receive equal spend); (2) average IPM — installs per 1000 impressions; (3) average CPI — cost per install (lower is better). Row 2: (4) average ROAS D7 — return on ad spend at day 7; (5) average Retention D7; (6) average Lumina D7 — the composite north-star. Bars colored green or blue when the value is in the favorable direction vs the production-weighted overall average; grey otherwise. Dotted line = overall average. CPI is colored inversely (lower is better).

DNA tags carry concept identity; peripheral tags are execution choices. Read the DNA tag charts as "if you change this, you're effectively testing a different concept"; read the peripheral tag charts as "safer within-concept tweaks."

Strategic Angle DNA TAG

Hook DNA TAG

Creative Focus DNA TAG

Gameplay Narrative Arc DNA TAG

Gameplay Representation Peripheral

Pacing Energy Peripheral

Presenter Layer Peripheral

Audio Narration Peripheral

CTA Strategy Peripheral

How to read these charts: A green bar = above the weighted overall average for that metric. Compare the share-of-production bar to the Lumina D7 bar — when they're misaligned, you've spotted a portfolio investment issue (over- or under-investment, see Level 4).

L2 5 · Diagnostic — Why did it happen?

5.1 Concept-class taxonomy

Every concept is classified into one of the categories below based on its mean Lumina, variance, sample size, and ceiling. This is the explore-vs-exploit map at the concept level.

Source: concept_level_analysis.csv column concept_class. 15 CONSISTENT WINNERS to scale, 19 HIGH-CEILING BETS to keep iterating, 42 STABLE UNDERPERFORMERS to kill.

Distribution of Lumina D7 within each concept class

The bar chart above shows class size; the boxplot below shows the actual distribution of per-creative Lumina D7 within each class. Read the spread (whiskers and outliers) to see how distinct the classes really are — a CONSISTENT WINNER's median ought to sit clearly above an AVERAGE PERFORMER's, and a LOTTERY EFFECT should have a wider IQR than a STABLE UNDERPERFORMER.

Source: creative_decision_facts.csv. Benchmark and single-version concepts excluded. Box = IQR, vertical line = median, dashed line = mean.

5.2 Variance decomposition

LayerVariance explained (D7)SignificanceConfounding
Concept (from variance_decomp)40.54%0.4054p<0.001Low
Producer (from variance_decomp)2.03%0.0203p<0.01High (V=0.31-0.62 with tags)
Strategic Angle (from variance_decomp)1.57%0.0157p<0.01High (V=0.47-0.59 with other tags)
gameplay_narrative_arc (individual)2.06%0.0206p<0.01Severe (V=0.33-0.55 vs tags, prod=0.47, window=0.45)
strategic_angle (individual)0.96%0.0096p<0.05Severe (V=0.39-0.59 vs tags, prod=0.51, window=0.29)
gameplay_representation (individual)0.51%0.0051p<0.01Severe (V=0.22-0.64 vs tags, prod=0.45, window=0.25)
hook (individual)0.47%0.0047p<0.01High (V=0.18-0.41 vs tags, prod=0.30, window=0.35)
audio_narration (individual)0.12%0.0012p<0.05Severe (V=0.30-0.62 vs tags, prod=0.62, window=0.30)
presenter_layer (individual)0.10%0.0010p<0.10Severe (V=0.27-0.67 vs tags, prod=0.60, window=0.32)
pacing_energy (individual)0.08%0.0008p<0.05Severe (V=0.18-0.62 vs tags, prod=0.54, window=0.25)
cta_strategy (individual)0.04%0.0004p=0.459Moderate (V=0.31-0.39 vs tags, prod=0.38, window=0.38)
creative_focus (individual)0.03%0.0003p<0.01Severe (V=0.29-0.67 vs tags, prod=0.49, window=0.29)
All 9 tags combined3.29%0.0329p<0.01Massive multicollinearity
Top 2 tags combined2.30%0.0230p<0.01High redundancy
Top 4 tags combined2.62%0.0262p<0.01Massive redundancy
Unexplained variance56.17%nannannan

Source: variance_breakdown_summary.csv. The "Concept" row groups creatives by orion_concept (e.g. C396, R1617) — the creative idea — not by individual creative_id. Confounding via Cramér's V on the categorical pairs.

5.3 DNA tag execution difficulty

How the axes are computed (read this before the chart): Bottom line: the top-right quadrant (high lift × high success rate × Moderate difficulty) is where evidence converges.
Reliable Winners — 3 DNA tag values clear all bars (p<0.01, success ≥50%, difficulty ≤ Moderate):
TagLumina D7 liftSuccess rateDifficultySampleSignificance
gameplay_narrative_arc: Fail → Win+24.5%73%Easyn=117Highly Significant (p<0.01)
hook: Problem Statement+20.2%64%Easyn=81Highly Significant (p<0.01)
strategic_angle: Cognitive Challenge+18.3%62%Easyn=71Highly Significant (p<0.01)

Source: dna_difficulty_matrix.csv joining tag_execution_difficulty.csv, tag_metric_correlations.csv, and producer_difficulty_analysis.csv.

5.4 Per-creative weakness diagnosis

Each D7-scored creative has a "weakest component" — the term in the Lumina D7 formula that contributed most negatively to its score. Distribution across 916 D7-scored creatives (4-component formula: IPM, ROAS_D7, RET_D7, −CPI):

Composition split by concept class:

5.5 Metric × tag heatmap — the cross-section view

For each tag value (rows, n≥30 only), the lift % vs creatives without the tag, across all 8 KPIs (columns). Cells are red→green, centered at 0%; cells annotated with the lift value. ** = p<0.05, * = p<0.10. Read across a row to see whether a tag lifts everything or only one specific KPI; read down a column to see which tags are the strongest levers for a given KPI.

Source: tag_metric_correlations.csv filtered to sample_size ≥ 30. Note: lifts are univariate (one tag at a time, no concept fixed effects) so they double-count correlation between confounded tags. Use this view for hypothesis generation, not causal claims.

L3 6 · Prescriptive — What should we do?

6.1 The 4-component decomposition flow

This is the core diagnostic mechanism. Given an underperforming creative:

  1. Pull its row from creative_decision_facts.csv.
  2. Identify the most-negative component (the weakest_component_d7 column).
  3. Look up that KPI in lift_cube.csv → ranked tag values that lift it.
  4. Filter to evidence_tier ∈ {HIGH_CONFIDENCE, MEDIUM_CONFIDENCE}.
  5. For the recommended tag, check producer_tag_performance for the best executor.
  6. If the tag is Hard / Very Hard, route to a Hard Tag Specialist from producer_difficulty_analysis.
  7. Cross-check tag_swap_recommendations.csv for a within-concept evidence-backed swap (EXPLOIT path).

6.2 Top metric drivers — Lumina D7

Driver tagLift %Samplep-valueSignificance
gameplay_narrative_arc: Fail → Win+24.5%1170.0000Highly Significant (p<0.01)
hook: Problem Statement+20.2%810.0011Highly Significant (p<0.01)
strategic_angle: Cognitive Challenge+18.3%710.0076Highly Significant (p<0.01)
gameplay_representation: Core Loop / Real-gameplay+16.5%280.0742Marginally Significant (p<0.10)
hook: Immediate Action+14.4%140.1947Not Significant (p≥0.10)

Source: metric_drivers.csv filtered to metric=lumina_score_d7.

6.3 Top within-concept tag swaps (EXPLOIT)

ConceptTag categoryCurrentRecommended+D7 LuminaP(positive)n versions
E34hookProblem StatementSocial Proof+65.5100%2
C414strategic_angleProgress & AchievementRelax & Escape+47.5100%3
C519gameplay_narrative_arcFail → WinLevels mix+46.0100%7
R1617hookNone (No Hook)Surprise / Unexpected+39.198%6
E44pacing_energyRising TensionHigh-Energy / Fast+37.4100%6
E44gameplay_representationExaggerated / Aspirational gameplayConceptual+37.4100%6
C450gameplay_narrative_arcOrder → ChaosFail → Win+35.198%12
C450cta_strategyNone (No CTA)End-Card Only+35.198%12

Source: tag_swap_recommendations.csv filtered to high-confidence rows. Note: only 31 of 169 concepts have a high-confidence swap available — the rest don't have enough version diversity for the swap algorithm to fire.

6.4 The engine on one creative — a worked example

Run suggest.py --creative <id> --window <id> on any creative and you get a full diagnostic. Here's the output for an actual underperformer in concept C504 (a CONSISTENT WINNER concept — meaning C504 versions normally do well, but this specific one didn't):

What to notice:
==============================================================================
CREATIVE: cmmon4dxm09ay0cpnm09lsu4b  |  Window: 19  |  Producer: Jeremy Laplatine
Concept:  C504  |  Class: CONSISTENT WINNER
          D7 Avg: 68.6 (p95), D7 Max: 78.4, D7 Std: 8.3, Versions: 5
------------------------------------------------------------------------------
Lumina D7: 57.4  (concept avg 68.6, -1.4σ vs concept peers)

DECOMPOSITION (sorted by drag, most negative first):
    Retention_D7           -0.95  ←  WEAKEST
    IPM                    +0.42
    ROAS_D7                +0.47
    CPI (lower better)     +1.06

EXPLOIT — within-concept tag swaps (n=0):
    [no high-confidence swaps available for this concept]

EXPLORE — cross-concept lifts on weakest KPI (ret_d7, n=4):
    • gameplay_narrative_arc: try 'Order → Chaos'
      +28.6% on retention_d7 (p=0.000, n=524, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
        → Best on this tag: Julie Droz (+19.0% D7 lift on this tag, n=45)
        → Best on this tag: Ayca Uyanik (+11.7% D7 lift on this tag, n=17)
    • strategic_angle: try 'Relax & Escape'
      +11.4% on retention_d7 (p=0.000, n=570, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
        → Best on this tag: Julie Droz (+15.6% D7 lift on this tag, n=51)
        → Best on this tag: Nikola Kachanski (+2.7% D7 lift on this tag, n=231)
    • hook: try 'Surprise / Unexpected'
      +2.8% on retention_d7 (p=0.019, n=284, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Moderate | Success rate: 47%
        → Best on this tag: Vira Bilous (+13.2% D7 lift on this tag, n=18)
        → Best on this tag: Nikola Kachanski (+2.5% D7 lift on this tag, n=104)
    • hook: try 'None (No Hook)'
      +1.7% on retention_d7 (p=0.032, n=525, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 51% ← HARD, route to specialist
        → Best on this tag: Nikola Kachanski (+5.2% D7 lift on this tag, n=144)
        → Best on this tag: Yevhenii Hrushetskyi (+2.6% D7 lift on this tag, n=34)
==============================================================================

6.5 From one-creative diagnostics to the concept brief

§6.4 is the engine analyzing a single creative. The biweekly deliverable works at a different grain — a concept brief, not a per-creative fix. Briefs ship for new concepts and new versions; nobody re-makes a creative that already shipped.

The bridge: the same logic (decomposition, EXPLOIT/EXPLORE, producer evidence) gets rolled up across the portfolio. For every concept the engine answers three questions:

Apply this to all D7-mature concepts and you get the cycle deliverable: Brief_Backlog_v1.html. One concept block looks like:

─────────────────────────────────────────────────────────────────────────────
C428                                                       [SCALE]  CONSISTENT WINNER
                          5 versions shipped · avg D7 = 53.7 · std = 12.9
─────────────────────────────────────────────────────────────────────────────

Rationale:
  C428 is a CONSISTENT WINNER. Goal: hold the proven DNA and systematically
  vary ONE peripheral tag per new version. The 3 versions below test the 3
  peripheral swaps with positive cross-concept evidence — generating data the
  concept currently lacks.

DNA (held constant):
  strategic_angle: Relax & Escape
  hook:            None (No Hook)
  creative_focus:  Gameplay Mechanic
  gameplay_narrative_arc: Order → Chaos

CONCEPT-LEVEL PRODUCER ROUTING:
  EXPLOIT  Producer-X +N% weighted (n on DNA tags, K/4 covered)
  EXPLORE  Producer-Y n=K on DNA tags (build capability)

Versions (3):
  V1 — None (Silent) audio                              H-2026-05-05-001
       Primary: audio_narration: Music/SFX → None (Silent)
       Expected D7 lift: +10.8%   [HIGH_CONFIDENCE, n=107]

  V2 — Core Loop / Real-gameplay representation         H-2026-05-05-002
       Primary: gameplay_representation: Exaggerated → Core Loop / Real-gameplay
       Expected D7 lift: +6.6%    [MEDIUM_CONFIDENCE, n=28]

  V3 — Always-On Banner CTA                             H-2026-05-05-003
       Primary: cta_strategy: End-Card Only → Always-On Banner
       Expected D7 lift: +2.3%    [MEDIUM_CONFIDENCE, n=46]
─────────────────────────────────────────────────────────────────────────────
Why these two artifacts together. §6.4 proves the engine can diagnose any creative — call it the runtime explainer; it answers "why did this specific creative underperform?" §6.5 / Brief Backlog proves the engine produces a brief-ready cycle deliverable — it answers "what should we ship next?" Both are the same explore-vs-exploit logic, just applied at different grains. The Creative Producer reads the Brief Backlog as a working document; suggest.py is available for ad-hoc analysis on any specific creative when needed.

Full operational deliverable: outputs_3/Brief_Backlog_v1.html — 13 active concepts (5 SCALE + 5 ITERATE + 3 EXPLORE) + 3 KILL recommendations, ~40 hypotheses total. The concept-level routing block aggregates each producer's track record across the concept's 4 DNA tags; the version-level prescriptions are the same defensible peripheral swaps applied across all SCALE/ITERATE concepts (this structural repetition is the data-generating mechanism for cycle-over-cycle learning).

6.6 How the engine builds the backlog

Read this if you need to defend a specific recommendation. The engine is not a learned model. It's four explicit rules running against CSVs that the upstream pipeline already produces. Every recommendation is reproducible from its inputs — no hidden weights, no opaque scoring. The full code is in outputs_3/build_brief_backlog.py.

Four steps

StepWhat it doesSource / rule
1. Concept selection
Which concepts get a brief this cycle?
SCALE = top N by avg D7 in CONSISTENT WINNER class.
ITERATE = top N in HIGH-CEILING BET.
EXPLORE = top N rows in underexplored_combinations.csv with is_underexplored=True.
KILL = bottom N in STABLE UNDERPERFORMER.
Source: concept_level_analysis.csv + underexplored_combinations.csv.
Selection is fully dynamic — no hardcoded lists.
Every concept must clear observation_count ≥ MIN_VERSIONS to qualify (excludes single-version concepts where std is undefined).
2. Version prescription
What does each new version test?
Hold the concept's 4 DNA tags constant. Generate one version per defensible peripheral swap. ITERATE concepts also get a 4th version if there's a high-confidence within-concept DNA swap available. Source: tag_metric_correlations.csv (peripheral filter) + tag_swap_recommendations.csv (DNA swap V4).
Defensible swap rule: lift > 0 AND p-value ≤ P_VALUE_MAX AND n ≥ MIN_N_DEFENSIBLE_SWAP. Negative-expected-lift swaps are explicitly excluded.
3. Producer routing
Who's the best motion designer for this brief?
For each concept, aggregate each producer's D7 lift on the concept's 4 DNA tags, weighted by their reps on each tag. EXPLOIT = top 2 by weighted lift. EXPLORE = bottom 2 (capability-building). Source: producer_tag_performance.csv.
Producer must have ≥ MIN_PRODUCER_REPS on at least one of the concept's DNA tags to enter the ranking. Per-version Hard-tag flag fires from dna_difficulty_matrix.csv only when the version's tag is Hard or Very Hard.
4. Hypothesis ID
How does each recommendation get tracked?
Each version recommendation receives a unique sequential ID: H-YYYY-MM-DD-NNN. The Director's brief references the ID, the Creative Producer logs the implemented concept name in the tracker, and the next pipeline cycle's loop-closure script computes predicted-vs-actual. Bookkeeping. The tracker schema is the persistent ledger; the cycle_log.csv records each cycle's inputs (data hash) and outputs (artifacts).

Tunable knobs

Every threshold that controls the backlog lives at the top of build_brief_backlog.py. Change a number, regenerate, ship. These are not statistical thresholds — they're operational levers the team can adjust as capacity or evidence appetite changes.

KnobCurrent valueWhat it controls
N_SCALE5How many CONSISTENT WINNER concepts get SCALE briefs each cycle
N_ITERATE5How many HIGH-CEILING BETS get ITERATE briefs
N_EXPLORE3How many under-explored DNA recipes get tested
N_KILL3How many STABLE UNDERPERFORMERs get deprioritized
MIN_VERSIONS3Minimum versions a concept must have to qualify (excludes singleton concepts)
MIN_RECIPE_MATURE2Minimum D7-mature creatives a recipe must have to be EXPLORE-eligible
P_VALUE_MAX0.10Max p-value for a peripheral swap to be "defensible"
MIN_N_DEFENSIBLE_SWAP15Min sample size for a peripheral swap
MIN_LIFT_DEFENSIBLE_SWAP0Min Lumina D7 lift % (>0 = positive only)
MIN_PRODUCER_REPS5Min reps a producer needs on a DNA tag to enter ranking

What the engine does NOT do

Honest about the limitations

Two known weaknesses worth stating up front:

Every cycle's run prints exact selections to stdout: concepts picked, defensible swaps qualified, CV quantiles used. That's the audit trail.

The hypothesis tracker — how the engine learns

The engine's recommendations are inputs to a feedback loop, not endpoints. Each version recommendation gets a unique ID (H-2026-05-05-001, etc.) that the Creative Producer references when shipping the brief. The Producer marks in the tracker which target concept name was assigned to each hypothesis. Two cycles later, when that concept's D7 data has matured, the engine automatically computes predicted-vs-actual and tags the hypothesis as HIT, MISS, or MIXED.

This is what turns a one-shot analysis into a learning system. Over 5–6 cycles we'll know: The output of the engine improves cycle over cycle because the tracker measures it. Outputs that hit get reinforced; outputs that miss get downweighted or dropped. The tracker is the mechanism that closes recommendation_outcomes.csv from §9 and graduates the engine from "best-evidence guidance" toward "evidence-validated guidance." This is the system we are explicitly building toward.

L4 7 · Strategic — Where should we invest?

7.1 Production concentration — OVER vs UNDER-invested tags

For each tag value, the chart shows how its mean Lumina D7 differs from the overall D7 average across the 825 D7-eligible non-benchmark creatives. Red = OVER-INVESTED (high volume + below-average D7 performance — these are the tags consuming most of your production capacity but pulling your average down; the rebalance candidates). Green = WELL-INVESTED. Orange = UNDER-INVESTED (low volume + above-average performance — scale candidates).

The rebalance story: the 5 most over-invested tags happen to be the high-volume staples (strategic_angle: Relax & Escape at 67% of production with -1.2% lift, hook: Surprise / Unexpected at 34% with -2.2%, gameplay_narrative_arc: Order → Chaos at 62% with -2.2%, and so on). Meanwhile the 3 most under-invested tags are exactly the 3 DNA Reliable Winners identified in §5.3 (Cognitive Challenge, Problem Statement, Core Loop / Real-gameplay). This is the cleanest-stated reallocation signal in the entire report.

Source: outputs_3/production_concentration_d7.csv — the Lumina D7 version, with benchmark creatives and "No Tag Applied" rows excluded (benchmarks recur across windows and would inflate volume artificially). Built by outputs_3/build_production_concentration_d7.py. NEUTRAL-status tags (medium-volume) are hidden in this chart for clarity; see the CSV for the full list.

7.2 Underexplored DNA recipes — the EXPLORE frontier

9-tag recipes flagged is_underexplored=True by the pipeline (recipe appears far less often than expected) AND with at least one D7-mature creative. Sorted by avg Lumina D7. These are direct EXPLORE candidates.

Strategic AngleHookCreative FocusNarrative ArcnAvg Lumina D7D7 percentile
Relax & EscapeNone (No Hook)Gameplay MechanicOrder → Chaos373.0100%
Destruction & ChaosSocial ProofGameplay MechanicNone (No Arc)466.499%
Relax & EscapeNone (No Hook)Gameplay MechanicOrder → Chaos466.397%
Explore/DiscoverySurprise / UnexpectedJuicy Effects Levels mix361.090%
Relax & EscapeSocial ProofNarrative MomentSteady Win359.288%
Progress & AchievementProblem StatementJuicy Effects Fail → Win357.286%
Progress & AchievementSurprise / UnexpectedGameplay MechanicFail → Win456.783%
Progress & AchievementUnrelated DisruptorGameplay MechanicSteady Win350.572%

Source: underexplored_combinations.csv. Sample sizes are intentionally small (n=3–5) — that's the definition of underexplored.

7.3 Producer overview & specialist routing

Producern creativesAvg Lumina D7Classificationp-value vs peers
Yevhenii Hrushetskyi6948.7Above Average0.267
Vira Bilous2546.7Above Average0.627
Julie Droz6248.5Above Average0.431
Alexandre de Crozals1347.1Above Average0.629
Nikola Kachanski34846.2Average0.224
Jeremy Laplatine22442.4Average0.071
Ayca Uyanik2441.1Below Average0.308
Ana Narchemashvili1437.5High Variance / High Ceiling0.279

Hard Tag Specialists (sorted by lift on hard tags vs peers):

ProducerSpecializationAvg Lumina D7 (hard tags)Lift vs peersSample
Julie DrozHard Tag Specialist (limited easy data)48.4+8.8%543
Alexandre de CrozalsHard Tag Specialist (limited easy data)47.5+8.7%97
Yevhenii HrushetskyiHard Tag Specialist (limited easy data)48.5+6.7%609
Vira BilousInsufficient Data47.0+4.0%219
Nikola KachanskiInsufficient Data46.2+1.6%3105
Jeremy LaplatineInsufficient Data42.5-5.9%1979
Ayca UyanikInsufficient Data41.2-7.7%212
Ana NarchemashviliInsufficient Data32.2-23.0%94

8 · The Suggestion Engine — Architecture

The engine sits on top of three new tables in outputs_3/:

  1. creative_decision_facts.csv — atomic per-(creative, window) table with the 4-component Lumina D7 decomposition (IPM, ROAS_D7, RET_D7, CPI). Joins concept-class. 1,078 rows.
  2. lift_cube.csv — (tag × KPI × producer) lookup. 523 rows × 29 cols. Pre-computed evidence tier (HIGH/MEDIUM/EXPLORATORY/INSUFFICIENT_DATA).
  3. dna_difficulty_matrix.csv — DNA tag values × difficulty × success rate × specialist routing. 27 rows × 19 cols.

Plus the orchestrator: suggest.py. Run as:

$ python3 outputs_3/suggest.py --creative cmidfpdzz02k5kz0cwmel029n --window 5
$ python3 outputs_3/suggest.py --worst-d7 5    # 5 worst D7 creatives
$ python3 outputs_3/suggest.py --concept R1617   # all creatives in concept

Evidence tiers (used everywhere in the engine)

TierRuleUse for
HIGH_CONFIDENCEp < 0.05 AND n ≥ 30Direct recommendations, action plans
MEDIUM_CONFIDENCEp < 0.10 AND n ≥ 15Hypothesis generation, A/B tests
EXPLORATORYweakerDiscussion, qualitative reads
INSUFFICIENT_DATAmissing p-value or nSkip

9 · Data Quality & Limitations

Honesty matters more than confidence. The items below are what this analysis cannot do today and the things in the data that warrant skepticism. Stating them up front is the only way recommendations downstream are trustworthy.
The long-run intent. Each item below is a known limitation, not an accepted compromise. The biweekly cycle is designed to address them over time: The framework's value isn't its current accuracy — it's that every cycle measurably narrows the gap.

9.1 What this analysis does NOT claim

9.2 Sample-size flags by analysis

AnalysisnCaveat
Variance decomposition789–916 D7 rowsHealthy — well-powered.
tag_metric_correlationsvaries by tag (some <15)Use evidence_tier to filter.
tag_swap_recommendations2–10 versions per swapMost CIs span zero; filter is_high_confidence=1.
Producer × tag liftsoften n=10–25Only 8 producers total; peer baseline is small.
Concept fixed effectsnot computedMajor upgrade opportunity.
Inheritance learning14% ancestor coverage"Net learning ≈ zero" finding is suspended-judgment.

9.3 Known data hygiene issues

9.4 Confounding warnings (Cramér's V)

10 · Roadmap & Open Questions

10.1 Highest-leverage upgrades

  1. Add concept fixed effects to tag_metric_correlations.csv and marginal_effects.csv. Will reduce most "tag lift" claims, surface the few that survive — those are the real ones.
  2. Close the recommendation_outcomes loop. Track which suggestions producers acted on, what happened. Without this we can't measure engine quality.
  3. Increase ancestor coverage for inheritance_learning_analysis. Currently 14% — push to ≥50% to make iteration claims trustworthy.
  4. Build per-creative explainability beyond the 4-component decomposition. Currently we say "IPM is low"; we don't say "IPM is low because the hook plays at 1.4 sec instead of 0.8."

10.2 Open questions for the team

11 · Appendix — CSV Reference

Every file used in this report, where it lives, and which level uses it:

FilePathLevelPurpose
creative_level_analysis.csvoutputs_test_v2/L1, L3Atomic creative-window facts. Source of truth for Lumina D7 + components.
creative_decision_facts.csvoutputs_3/L3 (engine)Per-creative Lumina decomposition + concept class join.
lift_cube.csvoutputs_3/L3 (engine)(tag × KPI × producer) → lift lookup with evidence tier.
concept_dna_ranking.csvoutputs_3/L2Mutual-info ranking of which tag categories define concept identity.
dna_difficulty_matrix.csvoutputs_3/L2, L3DNA tag values × difficulty × success rate × specialist routing.
variance_breakdown_summary.csvoutputs_test_v2/tag_combinations/L2What % of D7 variance each layer explains.
variance_decomposition.jsonoutputs_test_v2/tag_combinations/producer_analytics/L2ANOVA eta² for concept / producer / strategic_angle.
concept_level_analysis.csvoutputs_test_v2/tag_combinations/L2Concept-class taxonomy (CONSISTENT WINNER, HIGH-CEILING BET...)
tag_metric_correlations.csvoutputs_test_v2/tag_combinations/L1, L3(tag × KPI) lift table with p-values, used for chart bars.
metric_drivers.csvoutputs_test_v2/tag_combinations/L3Top-5 tag drivers per metric (IPM, ROAS, retention...).
tag_execution_difficulty.csvoutputs_test_v2/tag_combinations/L2Per-tag difficulty tier + success rate + ceiling/floor.
tag_swap_recommendations.csvoutputs_test_v2/tag_combinations/L3Within-concept tag swaps with 90% CIs.
underexplored_combinations.csvoutputs_test_v2/tag_combinations/L49-tag recipes with low n + high Lumina percentile.
production_concentration_d7.csvoutputs_3/L4Lumina D7 version, benchmarks excluded. Replaces the upstream D0 file.
production_concentration.csvoutputs_test_v2/tag_combinations/L4 (legacy)Original D0 version — superseded for §7.1 by the D7 file above.
weekly_lumina_d7_trajectory.csvoutputs_test_v2/tag_combinations/L1, L4Portfolio-wide Lumina trend, WoW change.
producer_overview.csvoutputs_test_v2/tag_combinations/producer_analytics/L2, L4Producer composite_score + p_value vs peers.
producer_difficulty_analysis.csvoutputs_test_v2/tag_combinations/producer_analytics/L3, L4Hard Tag Specialists (Julie, Alexandre).
insights_tag_*.csvoutputs_test_v2/L1Per-tag-category aggregates (avg_ipm, avg_lumina_d7, n).