Creative Intelligence — Creative Learnings & Production Iteration (tag system)

Andres Mendoza · Growth Data Analyst (Central Hybrid Team) · Analyst Edition

Analytical scope: Facebook Auto-test creatives North-star metric: Lumina Score D7 Methodology: 4-level descriptive→strategic framework Generated: from outputs_test_v2/ + outputs_3/

Audience: data analysts and data scientists. This is the canonical reference report — every claim is anchored to a CSV, every number is reproducible, every limitation is stated. A separate business-facing version will be derived from this.

Contents

Executive Summary
Business Problem & Approach
Framework & Methodology
LEVEL 1 — Descriptive
LEVEL 2 — Diagnostic
LEVEL 3 — Prescriptive
LEVEL 4 — Strategic
The Suggestion Engine
Data Quality & Limitations
Roadmap & Open Questions
Appendix: CSV Reference

1 · Executive Summary

998

Unique creatives

916

D7-scored

236

Concepts

Test windows

Producers

High-confidence suggestions

DNA tag categories

Reliable DNA winners

The headline: Concept (the creative idea) explains 40.5% of D7 Lumina variance. All 9 tags combined explain 3.3%. Concept choice is ~12× more predictive than tag choice. The framework treats concept selection as the strategic lever and tags as the operational one — they answer different questions.

Six headline findings

Concept choice dominates. 40.5% of variance comes from the concept (idea). 3.3% from all 9 tags. ~56% is unexplained (execution + noise). Tags are still the only knob a producer directly turns — see §3.2.
3 specific tag values reliably lift Lumina at p<0.01: gameplay_narrative_arc: Fail → Win (+24.5%, n=117), hook: Problem Statement (+20.2%, n=81), strategic_angle: Cognitive Challenge (+18.3%, n=71). Everything else is noise or worse.
Lumina D7 has 4 components: IPM, ROAS_D7, RET_D7, CPI (inverted). For any underperforming creative we can isolate which component is dragging the score — that's the diagnostic foundation for every recommendation.
4 of the 9 tag categories define the concept's identity (gameplay_narrative_arc, hook, strategic_angle, creative_focus). They carry 61% of the information distinguishing one concept from another. The other 5 are peripheral — execution choices that vary across versions of the same concept.
Hard Tag Specialists exist: Alexandre de Crozals (+14.0% on hard tags, n=48) and Julie Droz (+10.9%, n=131). For Hard-difficulty tags the engine routes to them.
Iteration is not currently improving creatives. 98 descendants beat their ancestors; 104 underperformed them. Roughly even. Caveat: only 14% of ancestors are in the dataset, so this isn't conclusive.

2 · Business Problem & Approach

Business Problem

Creative is the #1 lever for UA efficiency, yet creative decisions remain largely intuition-based.
High production volume, low signal: hundreds of variations without systematic feedback.
No feedback loop: creative teams don't know which elements to repeat, scale, or avoid.
Performance data exists but is disconnected: Lumina tracks metrics (IPM, CPI, ROAS, Retention) but there's no link to creative attributes.
No systematic way to diagnose why a creative underperformed.

Core Thesis

Creative performance is not random. Specific tag values consistently lift Lumina, and we can name them. The system uses these patterns to inform briefs — it's a guidance tool, not a forecasting model. Tags explain ~3% of total variance, so a recommendation is "best evidence we have," not a prediction.

The 4 Layers of Analysis

Level	Question it answers	Anchor outputs
L1 Descriptive	What happened?	`creative_level_analysis`, `insights_tag_*`, `tag_metric_correlations`, `weekly_lumina_d7_trajectory`
L2 Diagnostic	Why did it happen?	`concept_level_analysis`, `variance_breakdown_summary`, `tag_execution_difficulty`, `concept_dna_ranking`, `dna_difficulty_matrix`
L3 Prescriptive	What should we do?	`tag_swap_recommendations`, `metric_drivers`, `creative_decision_facts`, `lift_cube`, `suggest.py`
L4 Strategic	Where should we invest?	`production_concentration`, `underexplored_combinations`, `weekly_lumina_d7_trajectory`, `producer_difficulty_analysis`

3 · Framework & Methodology

3.1 The Lumina Formulas (D0 and D7)

Lumina is a composite score. D0 measures same-day performance (≤24h after the creative ran); D7 measures mature-cohort performance (≥7 days). Each horizon uses metrics from its own time window — D7 does not borrow D1 retention, because doing so would double-count retention and reward same-day engagement at the D7 layer.

raw_score_d0 = 1.2·z(log IPM) + 1.6·z(log ROAS_D0) + 1.0·z(log RET_D1) − 1.0·z(log CPI)
raw_score_d7 = 1.2·z(log IPM) + 1.6·z(log ROAS_D7) + 1.0·z(log RET_D7) − 1.0·z(log CPI)

penalized = raw_score × p_installs × p_impressions
lumina_score = (penalized − min) / (max − min) × 100 (per-window rescale for both D0 and D7 — Lumina compares creatives within the same test cohort)

Where:

z(·) = z-score normalization within test window (both D0 and D7) — every component is standardized against its window cohort.
p_installs = 0.5 if installs < threshold (D0=10, D7=25) else 1.0; p_impressions = 0.8 if impressions < 2000 else 1.0.
D7 mature = test window ended ≥7 days ago AND roas_d7 + retention_d7 are both non-null.
Final score range = 0 to 100, rounded to 2 decimals. Higher is better.

Verification: the persisted lumina_score_d7 in creative_level_analysis.csv matches independent recompute to 0.0000 absolute difference across all 916 D7-mature rows.

Formula change note. An earlier D7 formula included RET_D1 alongside RET_D7. We removed it because each horizon should use its own retention metric — using both double-counts retention and rewards same-day engagement at the D7 layer. Score impact: rank order is 95% preserved, average shift 3.2 points, max 12. All concept classes and tag aggregates in this report run against the corrected formula.

3.2 The Variance Ceiling — what's actually predictable

Quick clarification on terms:

A creative is one row: one specific version of one concept (e.g. C444_V4_WW) in one test window.
A concept is the underlying idea: C444, R1617, etc. We have 998 creatives across 236 concepts — about 4 versions per concept on average.
When we say "concept explains 40.5%" we mean: knowing the concept (the idea) predicts 40.5% of the spread between creatives within a test window. It does not mean knowing the specific creative_id.

Source: variance_breakdown_summary.csv + variance_decomposition.json (regenerated against corrected 4-component D7 formula; total D7 variance = 528.26). Eta-squared via ANOVA, grouping per-creative D7 scores by orion_concept (n = 789 D7-mature creative-window rows, 236 concepts).

Why tags still matter even though they only explain 3% — the four layers

Layer 1 — The concept (40.5% of variance). The biggest single thing predicting whether a creative wins is which idea it belongs to. The framework treats this as the strategic lever: scale concepts that consistently win, kill concepts that consistently lose. This is the Creative Director's call, not the producer's.

Layer 2 — Within-concept differences (~59% combined). Once you fix the concept, what separates V1 from V5? About 3% from tag choice, 2% from producer, the rest is execution (the part the data can't see — script timing, edit rhythm, music drop, etc.). Tags don't dominate this layer because versions of the same concept usually share the same tags on purpose. That's how iteration works.

Layer 3 — Tag patterns across the portfolio (3.3%). When we look across concepts, certain specific tag values systematically lift Lumina (the 3 Reliable Winners — see §5.3). 3.3% sounds small, but it's misleading to read it as "tags don't matter": most of the 100% lives at the concept layer or in unmeasured execution. Tags are 100% of the operational lever a producer can directly turn. They're where the producer's decision lives.

Layer 4 — Why the 4 DNA tags are special. Of the 9 tag categories, 4 carry concept identity (gameplay_narrative_arc, hook, strategic_angle, creative_focus). Changing one of them is essentially trying a different concept. The other 5 (cta_strategy, gameplay_representation, pacing_energy, audio_narration, presenter_layer) vary execution without changing what the concept is. This split is what makes the engine work: hold DNA constant, vary peripherals → controlled experiments. Change DNA → cross-concept exploration.

Bottom line. Concept layer answers "which ideas to test." DNA-tag layer answers "what should the next idea look like." Peripheral-tag layer answers "how do we iterate safely on a working idea." The engine operates at all three. Treating any one as "the answer" misses the structure.

Within-concept version variance — where iteration pays off

A concept's average Lumina tells you whether the idea works on average. The within-concept standard deviation tells you whether iteration on its versions pays off — high std means the best version beats the worst by a lot, so testing more versions captures real upside. This is the explore-vs-exploit signal at the concept level. Each dot below is one concept (≥3 versions); marker size = number of versions.

Concept class	Definition	Operational read
CONSISTENT WINNER	High avg Lumina D7, low within-concept std. Most versions land above the median; the worst version isn't far below the best.	Scale this concept (more spend, more variants in production). Reliable but not teaching us anything new — versions are too similar to extract iteration learnings.
HIGH-CEILING BET	High avg Lumina D7, high within-concept std. Best version is much better than worst — iteration captures real upside.	Keep producing more versions; the marginal version has high expected value because the spread between V1 and V_best is wide.
AVERAGE PERFORMER	Mid avg, mid std. Neither obviously winning nor failing.	Hold; revisit if data sharpens. Not the best place to invest the next iteration cycle.
NEEDS ANALYSIS	Mixed signals; classification rules don't put it cleanly in any other bucket.	Manual review — check whether the concept is genuinely ambiguous or whether it has insufficient data (≥3 versions but unstable scoring).
LOTTERY EFFECT	Low avg Lumina D7 but high within-concept std. Most versions underperform; one or two outliers carry the average.	Don't double down on this concept — the wins are random, not replicable. Investigate whether the outlier had a unique execution that's worth porting elsewhere.
STABLE UNDERPERFORMER	Low avg Lumina D7, low within-concept std. Reliably bad. Versions cluster around a low ceiling.	Kill candidate. No iteration upside, no outlier hope. Reallocate the production budget.
Single-version + benchmark concepts get separate labels (Single Version Low/Average/High; BENCHMARK – Stable High / Variable High / Variable Average / Underperforming) and are excluded from the chart below because std is undefined or non-comparable.

Source: concept_level_analysis.csv (filtered to concepts with ≥3 versions, benchmarks excluded). Quadrant reading: top-right = HIGH-CEILING BETS; top-left = LOTTERY EFFECTS; bottom-right = CONSISTENT WINNERS; bottom-left = STABLE UNDERPERFORMERS. Median Lumina and median std reference lines split the quadrants.

3.3 Concept DNA — which tags carry concept identity

Mutual information between each tag category and orion_concept, computed on de-duplicated unique creatives. Higher MI = the tag does more work distinguishing one concept from another. The top 4 tags cover 61% of the total — these are the concept's structural DNA. The other 5 are peripheral execution choices: cosmetics that vary across versions of the same concept without changing what the concept fundamentally is.

Source: concept_dna_ranking.csv built via sklearn.feature_selection.mutual_info_classif on 998 de-duplicated creatives. Concept entropy = 7.54 bits.

3.4 Reading guide

Each subsequent section is one of the 4 levels. Within each section, charts come from the CSVs listed in the appendix. Every recommendation has an evidence tier — see Section 9 for definitions.

L1 4 · Descriptive — What happened?

4.1 Portfolio overview

The dataset spans 998 unique creatives across 236 concepts, 22 producers, and 24 test windows. Of these, 916 have a computed Lumina D7 score.

4.2 Weekly Lumina D7 trajectory

Source: weekly_lumina_d7_trajectory.csv. Hovers above ~37–41 across the period. No clear improvement trend over 23 windows — consistent with the "net learning ≈ zero" finding.

4.3 Per-tag-category performance — multi-metric panels

For each tag category, six side-by-side views in two rows. Row 1: (1) share of production — how often each value is used (currently equivalent to share of spend, since auto-test creatives receive equal spend); (2) average IPM — installs per 1000 impressions; (3) average CPI — cost per install (lower is better). Row 2: (4) average ROAS D7 — return on ad spend at day 7; (5) average Retention D7; (6) average Lumina D7 — the composite north-star. Bars colored green or blue when the value is in the favorable direction vs the production-weighted overall average; grey otherwise. Dotted line = overall average. CPI is colored inversely (lower is better).

DNA tags carry concept identity; peripheral tags are execution choices. Read the DNA tag charts as "if you change this, you're effectively testing a different concept"; read the peripheral tag charts as "safer within-concept tweaks."

Strategic Angle DNA TAG

Hook DNA TAG

Creative Focus DNA TAG

Gameplay Narrative Arc DNA TAG

Gameplay Representation Peripheral

Pacing Energy Peripheral

Presenter Layer Peripheral

Audio Narration Peripheral

CTA Strategy Peripheral

How to read these charts: A green bar = above the weighted overall average for that metric. Compare the share-of-production bar to the Lumina D7 bar — when they're misaligned, you've spotted a portfolio investment issue (over- or under-investment, see Level 4).

L2 5 · Diagnostic — Why did it happen?

5.1 Concept-class taxonomy

Every concept is classified into one of the categories below based on its mean Lumina, variance, sample size, and ceiling. This is the explore-vs-exploit map at the concept level.

Source: concept_level_analysis.csv column concept_class. 15 CONSISTENT WINNERS to scale, 19 HIGH-CEILING BETS to keep iterating, 42 STABLE UNDERPERFORMERS to kill.

Distribution of Lumina D7 within each concept class

The bar chart above shows class size; the boxplot below shows the actual distribution of per-creative Lumina D7 within each class. Read the spread (whiskers and outliers) to see how distinct the classes really are — a CONSISTENT WINNER's median ought to sit clearly above an AVERAGE PERFORMER's, and a LOTTERY EFFECT should have a wider IQR than a STABLE UNDERPERFORMER.

Source: creative_decision_facts.csv. Benchmark and single-version concepts excluded. Box = IQR, vertical line = median, dashed line = mean.

5.2 Variance decomposition

Layer	Variance explained (D7)	R²	Significance	Confounding
Concept (from variance_decomp)	40.54%	0.4054	p<0.001	Low
Producer (from variance_decomp)	2.03%	0.0203	p<0.01	High (V=0.31-0.62 with tags)
Strategic Angle (from variance_decomp)	1.57%	0.0157	p<0.01	High (V=0.47-0.59 with other tags)
gameplay_narrative_arc (individual)	2.06%	0.0206	p<0.01	Severe (V=0.33-0.55 vs tags, prod=0.47, window=0.45)
strategic_angle (individual)	0.96%	0.0096	p<0.05	Severe (V=0.39-0.59 vs tags, prod=0.51, window=0.29)
gameplay_representation (individual)	0.51%	0.0051	p<0.01	Severe (V=0.22-0.64 vs tags, prod=0.45, window=0.25)
hook (individual)	0.47%	0.0047	p<0.01	High (V=0.18-0.41 vs tags, prod=0.30, window=0.35)
audio_narration (individual)	0.12%	0.0012	p<0.05	Severe (V=0.30-0.62 vs tags, prod=0.62, window=0.30)
presenter_layer (individual)	0.10%	0.0010	p<0.10	Severe (V=0.27-0.67 vs tags, prod=0.60, window=0.32)
pacing_energy (individual)	0.08%	0.0008	p<0.05	Severe (V=0.18-0.62 vs tags, prod=0.54, window=0.25)
cta_strategy (individual)	0.04%	0.0004	p=0.459	Moderate (V=0.31-0.39 vs tags, prod=0.38, window=0.38)
creative_focus (individual)	0.03%	0.0003	p<0.01	Severe (V=0.29-0.67 vs tags, prod=0.49, window=0.29)
All 9 tags combined	3.29%	0.0329	p<0.01	Massive multicollinearity
Top 2 tags combined	2.30%	0.0230	p<0.01	High redundancy
Top 4 tags combined	2.62%	0.0262	p<0.01	Massive redundancy
Unexplained variance	56.17%	nan	nan	nan

Source: variance_breakdown_summary.csv. The "Concept" row groups creatives by orion_concept (e.g. C396, R1617) — the creative idea — not by individual creative_id. Confounding via Cramér's V on the categorical pairs.

5.3 DNA tag execution difficulty

How the axes are computed (read this before the chart):

X-axis — Lumina D7 lift % (with-tag vs without-tag): for each tag value, compare the mean Lumina D7 of creatives that have the tag against creatives that do not have it. Lift = (mean_with − mean_without) / |mean_without| × 100. Source column: tag_metric_correlations.lift_pct filtered to metric=lumina_score_d7. Caveat: not concept-controlled — some lift may be concept confounding rather than the tag's own effect.
Y-axis — Success rate D7 (%): proportion of creatives carrying the tag whose Lumina D7 lands above a "successful creative" threshold (defined per-tag by the production pipeline using its overall D7 distribution). Source column: tag_execution_difficulty.success_rate_d7. A tag with success_rate=0.70 means 70% of the creatives that used it were classified as successful; 30% under-performed.
Color — Execution difficulty (1–4 score): derived in tag_execution_difficulty.csv from within-tag std + range spread + sample-size adjustment. Easy/Moderate = the tag delivers consistently; Hard/Very Hard = wide variance, success depends heavily on execution quality. Lower difficulty does NOT mean better — it means more predictable.
Marker size: sample size (creatives carrying this tag). Hide n<10 to suppress the noisiest tail.

Bottom line: the top-right quadrant (high lift × high success rate × Moderate difficulty) is where evidence converges.

Reliable Winners — 3 DNA tag values clear all bars (p<0.01, success ≥50%, difficulty ≤ Moderate):

Tag	Lumina D7 lift	Success rate	Difficulty	Sample	Significance
gameplay_narrative_arc: Fail → Win	+24.5%	73%	Easy	n=117	Highly Significant (p<0.01)
hook: Problem Statement	+20.2%	64%	Easy	n=81	Highly Significant (p<0.01)
strategic_angle: Cognitive Challenge	+18.3%	62%	Easy	n=71	Highly Significant (p<0.01)

Source: dna_difficulty_matrix.csv joining tag_execution_difficulty.csv, tag_metric_correlations.csv, and producer_difficulty_analysis.csv.

5.4 Per-creative weakness diagnosis

Each D7-scored creative has a "weakest component" — the term in the Lumina D7 formula that contributed most negatively to its score. Distribution across 916 D7-scored creatives (4-component formula: IPM, ROAS_D7, RET_D7, −CPI):

ROAS_D7 drags 311 creatives (34%)
Retention_D7 drags 272 (30%)
IPM drags 203 (22%)
CPI drags 130 (14%)

Composition split by concept class:

5.5 Metric × tag heatmap — the cross-section view

For each tag value (rows, n≥30 only), the lift % vs creatives without the tag, across all 8 KPIs (columns). Cells are red→green, centered at 0%; cells annotated with the lift value. ** = p<0.05, * = p<0.10. Read across a row to see whether a tag lifts everything or only one specific KPI; read down a column to see which tags are the strongest levers for a given KPI.

Source: tag_metric_correlations.csv filtered to sample_size ≥ 30. Note: lifts are univariate (one tag at a time, no concept fixed effects) so they double-count correlation between confounded tags. Use this view for hypothesis generation, not causal claims.

L3 6 · Prescriptive — What should we do?

6.1 The 4-component decomposition flow

This is the core diagnostic mechanism. Given an underperforming creative:

Pull its row from creative_decision_facts.csv.
Identify the most-negative component (the weakest_component_d7 column).
Look up that KPI in lift_cube.csv → ranked tag values that lift it.
Filter to evidence_tier ∈ {HIGH_CONFIDENCE, MEDIUM_CONFIDENCE}.
For the recommended tag, check producer_tag_performance for the best executor.
If the tag is Hard / Very Hard, route to a Hard Tag Specialist from producer_difficulty_analysis.
Cross-check tag_swap_recommendations.csv for a within-concept evidence-backed swap (EXPLOIT path).

6.2 Top metric drivers — Lumina D7

Driver tag	Lift %	Sample	p-value	Significance
gameplay_narrative_arc: Fail → Win	+24.5%	117	0.0000	Highly Significant (p<0.01)
hook: Problem Statement	+20.2%	81	0.0011	Highly Significant (p<0.01)
strategic_angle: Cognitive Challenge	+18.3%	71	0.0076	Highly Significant (p<0.01)
gameplay_representation: Core Loop / Real-gameplay	+16.5%	28	0.0742	Marginally Significant (p<0.10)
hook: Immediate Action	+14.4%	14	0.1947	Not Significant (p≥0.10)

Source: metric_drivers.csv filtered to metric=lumina_score_d7.

6.3 Top within-concept tag swaps (EXPLOIT)

Concept	Tag category	Current	Recommended	+D7 Lumina	P(positive)	n versions
E34	hook	Problem Statement	Social Proof	+65.5	100%	2
C414	strategic_angle	Progress & Achievement	Relax & Escape	+47.5	100%	3
C519	gameplay_narrative_arc	Fail → Win	Levels mix	+46.0	100%	7
R1617	hook	None (No Hook)	Surprise / Unexpected	+39.1	98%	6
E44	pacing_energy	Rising Tension	High-Energy / Fast	+37.4	100%	6
E44	gameplay_representation	Exaggerated / Aspirational gameplay	Conceptual	+37.4	100%	6
C450	gameplay_narrative_arc	Order → Chaos	Fail → Win	+35.1	98%	12
C450	cta_strategy	None (No CTA)	End-Card Only	+35.1	98%	12

Source: tag_swap_recommendations.csv filtered to high-confidence rows. Note: only 31 of 169 concepts have a high-confidence swap available — the rest don't have enough version diversity for the swap algorithm to fire.

6.4 The engine on one creative — a worked example

Run suggest.py --creative <id> --window <id> on any creative and you get a full diagnostic. Here's the output for an actual underperformer in concept C504 (a CONSISTENT WINNER concept — meaning C504 versions normally do well, but this specific one didn't):

What to notice:

Lumina = 57.4 sounds fine in absolute terms, but it's −1.4σ below the rest of C504's versions. The score is honest about underperformance even when it's not a disaster.
Decomposition isolates the problem: Retention D7 is dragging the score. The other 3 components are positive. So the question is "how do we fix retention," not "why is this creative bad."
EXPLOIT path is empty. The engine doesn't pretend to have a within-concept fix when there isn't one — C504 doesn't have enough variation across its 5 versions for the swap algorithm to have evidence. This is a real and useful "no answer" answer.
EXPLORE path offers 4 cross-concept moves on retention_d7. All carry DNA warnings — they'd effectively turn C504 into a sibling concept. The Director sees this honestly and decides whether the upside is worth changing the concept.
Producer routing punchline: for the strongest cross-concept option (Order → Chaos, +28.6% retention_d7), Julie Droz has the best track record on that tag (+19% D7 lift, n=45). The engine surfaced both the recommendation and who can execute it.

==============================================================================
CREATIVE: cmmon4dxm09ay0cpnm09lsu4b  |  Window: 19  |  Producer: Jeremy Laplatine
Concept:  C504  |  Class: CONSISTENT WINNER
          D7 Avg: 68.6 (p95), D7 Max: 78.4, D7 Std: 8.3, Versions: 5
------------------------------------------------------------------------------
Lumina D7: 57.4  (concept avg 68.6, -1.4σ vs concept peers)

DECOMPOSITION (sorted by drag, most negative first):
    Retention_D7           -0.95  ←  WEAKEST
    IPM                    +0.42
    ROAS_D7                +0.47
    CPI (lower better)     +1.06

EXPLOIT — within-concept tag swaps (n=0):
    [no high-confidence swaps available for this concept]

EXPLORE — cross-concept lifts on weakest KPI (ret_d7, n=4):
    • gameplay_narrative_arc: try 'Order → Chaos'
      +28.6% on retention_d7 (p=0.000, n=524, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
        → Best on this tag: Julie Droz (+19.0% D7 lift on this tag, n=45)
        → Best on this tag: Ayca Uyanik (+11.7% D7 lift on this tag, n=17)
    • strategic_angle: try 'Relax & Escape'
      +11.4% on retention_d7 (p=0.000, n=570, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 48% ← HARD, route to specialist
        → Best on this tag: Julie Droz (+15.6% D7 lift on this tag, n=51)
        → Best on this tag: Nikola Kachanski (+2.7% D7 lift on this tag, n=231)
    • hook: try 'Surprise / Unexpected'
      +2.8% on retention_d7 (p=0.019, n=284, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Moderate | Success rate: 47%
        → Best on this tag: Vira Bilous (+13.2% D7 lift on this tag, n=18)
        → Best on this tag: Nikola Kachanski (+2.5% D7 lift on this tag, n=104)
    • hook: try 'None (No Hook)'
      +1.7% on retention_d7 (p=0.032, n=525, HIGH_CONFIDENCE)  [⚠ DNA tag]
      Difficulty: Hard | Success rate: 51% ← HARD, route to specialist
        → Best on this tag: Nikola Kachanski (+5.2% D7 lift on this tag, n=144)
        → Best on this tag: Yevhenii Hrushetskyi (+2.6% D7 lift on this tag, n=34)
==============================================================================

6.5 From one-creative diagnostics to the concept brief

§6.4 is the engine analyzing a single creative. The biweekly deliverable works at a different grain — a concept brief, not a per-creative fix. Briefs ship for new concepts and new versions; nobody re-makes a creative that already shipped.

The bridge: the same logic (decomposition, EXPLOIT/EXPLORE, producer evidence) gets rolled up across the portfolio. For every concept the engine answers three questions:

What action? SCALE, ITERATE, EXPLORE, or KILL.
Which peripheral tags should we vary in the next versions? Hold the 4-tag DNA constant; vary one peripheral per version.
Who should execute it? The producer with the best track record on the concept's DNA combination, plus a capability-building option.

Apply this to all D7-mature concepts and you get the cycle deliverable: Brief_Backlog_v1.html. One concept block looks like:

─────────────────────────────────────────────────────────────────────────────
C428                                                       [SCALE]  CONSISTENT WINNER
                          5 versions shipped · avg D7 = 53.7 · std = 12.9
─────────────────────────────────────────────────────────────────────────────

Rationale:
  C428 is a CONSISTENT WINNER. Goal: hold the proven DNA and systematically
  vary ONE peripheral tag per new version. The 3 versions below test the 3
  peripheral swaps with positive cross-concept evidence — generating data the
  concept currently lacks.

DNA (held constant):
  strategic_angle: Relax & Escape
  hook:            None (No Hook)
  creative_focus:  Gameplay Mechanic
  gameplay_narrative_arc: Order → Chaos

CONCEPT-LEVEL PRODUCER ROUTING:
  EXPLOIT  Producer-X +N% weighted (n on DNA tags, K/4 covered)
  EXPLORE  Producer-Y n=K on DNA tags (build capability)

Versions (3):
  V1 — None (Silent) audio                              H-2026-05-05-001
       Primary: audio_narration: Music/SFX → None (Silent)
       Expected D7 lift: +10.8%   [HIGH_CONFIDENCE, n=107]

  V2 — Core Loop / Real-gameplay representation         H-2026-05-05-002
       Primary: gameplay_representation: Exaggerated → Core Loop / Real-gameplay
       Expected D7 lift: +6.6%    [MEDIUM_CONFIDENCE, n=28]

  V3 — Always-On Banner CTA                             H-2026-05-05-003
       Primary: cta_strategy: End-Card Only → Always-On Banner
       Expected D7 lift: +2.3%    [MEDIUM_CONFIDENCE, n=46]
─────────────────────────────────────────────────────────────────────────────

Why these two artifacts together. §6.4 proves the engine can diagnose any creative — call it the runtime explainer; it answers "why did this specific creative underperform?" §6.5 / Brief Backlog proves the engine produces a brief-ready cycle deliverable — it answers "what should we ship next?" Both are the same explore-vs-exploit logic, just applied at different grains. The Creative Producer reads the Brief Backlog as a working document; suggest.py is available for ad-hoc analysis on any specific creative when needed.

Full operational deliverable: outputs_3/Brief_Backlog_v1.html — 13 active concepts (5 SCALE + 5 ITERATE + 3 EXPLORE) + 3 KILL recommendations, ~40 hypotheses total. The concept-level routing block aggregates each producer's track record across the concept's 4 DNA tags; the version-level prescriptions are the same defensible peripheral swaps applied across all SCALE/ITERATE concepts (this structural repetition is the data-generating mechanism for cycle-over-cycle learning).

6.6 How the engine builds the backlog

Read this if you need to defend a specific recommendation. The engine is not a learned model. It's four explicit rules running against CSVs that the upstream pipeline already produces. Every recommendation is reproducible from its inputs — no hidden weights, no opaque scoring. The full code is in outputs_3/build_brief_backlog.py.

Four steps

Step	What it does	Source / rule
1. Concept selection Which concepts get a brief this cycle?	SCALE = top N by avg D7 in CONSISTENT WINNER class. ITERATE = top N in HIGH-CEILING BET. EXPLORE = top N rows in `underexplored_combinations.csv` with `is_underexplored=True`. KILL = bottom N in STABLE UNDERPERFORMER.	Source: `concept_level_analysis.csv` + `underexplored_combinations.csv`. Selection is fully dynamic — no hardcoded lists. Every concept must clear `observation_count ≥ MIN_VERSIONS` to qualify (excludes single-version concepts where std is undefined).
2. Version prescription What does each new version test?	Hold the concept's 4 DNA tags constant. Generate one version per defensible peripheral swap. ITERATE concepts also get a 4th version if there's a high-confidence within-concept DNA swap available.	Source: `tag_metric_correlations.csv` (peripheral filter) + `tag_swap_recommendations.csv` (DNA swap V4). Defensible swap rule: lift > 0 AND p-value ≤ `P_VALUE_MAX` AND n ≥ `MIN_N_DEFENSIBLE_SWAP`. Negative-expected-lift swaps are explicitly excluded.
3. Producer routing Who's the best motion designer for this brief?	For each concept, aggregate each producer's D7 lift on the concept's 4 DNA tags, weighted by their reps on each tag. EXPLOIT = top 2 by weighted lift. EXPLORE = bottom 2 (capability-building).	Source: `producer_tag_performance.csv`. Producer must have ≥ `MIN_PRODUCER_REPS` on at least one of the concept's DNA tags to enter the ranking. Per-version Hard-tag flag fires from `dna_difficulty_matrix.csv` only when the version's tag is Hard or Very Hard.
4. Hypothesis ID How does each recommendation get tracked?	Each version recommendation receives a unique sequential ID: `H-YYYY-MM-DD-NNN`. The Director's brief references the ID, the Creative Producer logs the implemented concept name in the tracker, and the next pipeline cycle's loop-closure script computes predicted-vs-actual.	Bookkeeping. The tracker schema is the persistent ledger; the `cycle_log.csv` records each cycle's inputs (data hash) and outputs (artifacts).

Tunable knobs

Every threshold that controls the backlog lives at the top of build_brief_backlog.py. Change a number, regenerate, ship. These are not statistical thresholds — they're operational levers the team can adjust as capacity or evidence appetite changes.

Knob	Current value	What it controls
`N_SCALE`	5	How many CONSISTENT WINNER concepts get SCALE briefs each cycle
`N_ITERATE`	5	How many HIGH-CEILING BETS get ITERATE briefs
`N_EXPLORE`	3	How many under-explored DNA recipes get tested
`N_KILL`	3	How many STABLE UNDERPERFORMERs get deprioritized
`MIN_VERSIONS`	3	Minimum versions a concept must have to qualify (excludes singleton concepts)
`MIN_RECIPE_MATURE`	2	Minimum D7-mature creatives a recipe must have to be EXPLORE-eligible
`P_VALUE_MAX`	0.10	Max p-value for a peripheral swap to be "defensible"
`MIN_N_DEFENSIBLE_SWAP`	15	Min sample size for a peripheral swap
`MIN_LIFT_DEFENSIBLE_SWAP`	0	Min Lumina D7 lift % (>0 = positive only)
`MIN_PRODUCER_REPS`	5	Min reps a producer needs on a DNA tag to enter ranking

What the engine does NOT do

No learned weights. Every threshold is named in the script. No hidden model.
No campaign-level forecasting. "Expected D7 lift" is cross-concept evidence, not a prediction of what your specific creative will score.
No automated assignments. Producer routing surfaces track records; the Creative Producer assigns.
No approval authority. Every recommendation can be overridden. Overrides are themselves data — over time we learn which recommendation types the team trusts.

Honest about the limitations

Two known weaknesses worth stating up front:

Top-N counts are operational, not statistical. Picking 5 SCALE / 5 ITERATE / 3 EXPLORE / 3 KILL is sized to roughly match motion-design capacity per biweekly cycle. If capacity changes, change the knob.
The defensible-swap filter is observational. When we say a peripheral swap "lifts D7 by +13.7%," that's the cross-concept correlation — we don't yet control for which concepts the swap was tested on. The biweekly brief design (same swap applied across multiple concepts each cycle) is exactly what generates the controlled data needed to compute true causal lifts in 5+ cycles.

Every cycle's run prints exact selections to stdout: concepts picked, defensible swaps qualified, CV quantiles used. That's the audit trail.

The hypothesis tracker — how the engine learns

The engine's recommendations are inputs to a feedback loop, not endpoints. Each version recommendation gets a unique ID (H-2026-05-05-001, etc.) that the Creative Producer references when shipping the brief. The Producer marks in the tracker which target concept name was assigned to each hypothesis. Two cycles later, when that concept's D7 data has matured, the engine automatically computes predicted-vs-actual and tags the hypothesis as HIT, MISS, or MIXED.

This is what turns a one-shot analysis into a learning system. Over 5–6 cycles we'll know:

Which recommendation types we should trust (EXPLOIT vs EXPLORE hit rates by concept class).
Which "defensible peripheral swaps" actually deliver — and which were observational noise.
Whether the 3 Reliable Winners stay reliable under deliberate adoption.
Which producers genuinely lift specific tags vs which were assigned to better concepts.

The output of the engine improves cycle over cycle because the tracker measures it. Outputs that hit get reinforced; outputs that miss get downweighted or dropped. The tracker is the mechanism that closes recommendation_outcomes.csv from §9 and graduates the engine from "best-evidence guidance" toward "evidence-validated guidance." This is the system we are explicitly building toward.

L4 7 · Strategic — Where should we invest?

7.1 Production concentration — OVER vs UNDER-invested tags

For each tag value, the chart shows how its mean Lumina D7 differs from the overall D7 average across the 825 D7-eligible non-benchmark creatives. Red = OVER-INVESTED (high volume + below-average D7 performance — these are the tags consuming most of your production capacity but pulling your average down; the rebalance candidates). Green = WELL-INVESTED. Orange = UNDER-INVESTED (low volume + above-average performance — scale candidates).

The rebalance story: the 5 most over-invested tags happen to be the high-volume staples (strategic_angle: Relax & Escape at 67% of production with -1.2% lift, hook: Surprise / Unexpected at 34% with -2.2%, gameplay_narrative_arc: Order → Chaos at 62% with -2.2%, and so on). Meanwhile the 3 most under-invested tags are exactly the 3 DNA Reliable Winners identified in §5.3 (Cognitive Challenge, Problem Statement, Core Loop / Real-gameplay). This is the cleanest-stated reallocation signal in the entire report.

Source: outputs_3/production_concentration_d7.csv — the Lumina D7 version, with benchmark creatives and "No Tag Applied" rows excluded (benchmarks recur across windows and would inflate volume artificially). Built by outputs_3/build_production_concentration_d7.py. NEUTRAL-status tags (medium-volume) are hidden in this chart for clarity; see the CSV for the full list.

7.2 Underexplored DNA recipes — the EXPLORE frontier

9-tag recipes flagged is_underexplored=True by the pipeline (recipe appears far less often than expected) AND with at least one D7-mature creative. Sorted by avg Lumina D7. These are direct EXPLORE candidates.

Strategic Angle	Hook	Creative Focus	Narrative Arc	n	Avg Lumina D7	D7 percentile
Relax & Escape	None (No Hook)	Gameplay Mechanic	Order → Chaos	3	73.0	100%
Destruction & Chaos	Social Proof	Gameplay Mechanic	None (No Arc)	4	66.4	99%
Relax & Escape	None (No Hook)	Gameplay Mechanic	Order → Chaos	4	66.3	97%
Explore/Discovery	Surprise / Unexpected	Juicy Effects	Levels mix	3	61.0	90%
Relax & Escape	Social Proof	Narrative Moment	Steady Win	3	59.2	88%
Progress & Achievement	Problem Statement	Juicy Effects	Fail → Win	3	57.2	86%
Progress & Achievement	Surprise / Unexpected	Gameplay Mechanic	Fail → Win	4	56.7	83%
Progress & Achievement	Unrelated Disruptor	Gameplay Mechanic	Steady Win	3	50.5	72%

Source: underexplored_combinations.csv. Sample sizes are intentionally small (n=3–5) — that's the definition of underexplored.

7.3 Producer overview & specialist routing

Producer	n creatives	Avg Lumina D7	Classification	p-value vs peers
Yevhenii Hrushetskyi	69	48.7	Above Average	0.267
Vira Bilous	25	46.7	Above Average	0.627
Julie Droz	62	48.5	Above Average	0.431
Alexandre de Crozals	13	47.1	Above Average	0.629
Nikola Kachanski	348	46.2	Average	0.224
Jeremy Laplatine	224	42.4	Average	0.071
Ayca Uyanik	24	41.1	Below Average	0.308
Ana Narchemashvili	14	37.5	High Variance / High Ceiling	0.279

Hard Tag Specialists (sorted by lift on hard tags vs peers):

Producer	Specialization	Avg Lumina D7 (hard tags)	Lift vs peers	Sample
Julie Droz	Hard Tag Specialist (limited easy data)	48.4	+8.8%	543
Alexandre de Crozals	Hard Tag Specialist (limited easy data)	47.5	+8.7%	97
Yevhenii Hrushetskyi	Hard Tag Specialist (limited easy data)	48.5	+6.7%	609
Vira Bilous	Insufficient Data	47.0	+4.0%	219
Nikola Kachanski	Insufficient Data	46.2	+1.6%	3105
Jeremy Laplatine	Insufficient Data	42.5	-5.9%	1979
Ayca Uyanik	Insufficient Data	41.2	-7.7%	212
Ana Narchemashvili	Insufficient Data	32.2	-23.0%	94

8 · The Suggestion Engine — Architecture

The engine sits on top of three new tables in outputs_3/:

creative_decision_facts.csv — atomic per-(creative, window) table with the 4-component Lumina D7 decomposition (IPM, ROAS_D7, RET_D7, CPI). Joins concept-class. 1,078 rows.
lift_cube.csv — (tag × KPI × producer) lookup. 523 rows × 29 cols. Pre-computed evidence tier (HIGH/MEDIUM/EXPLORATORY/INSUFFICIENT_DATA).
dna_difficulty_matrix.csv — DNA tag values × difficulty × success rate × specialist routing. 27 rows × 19 cols.

Plus the orchestrator: suggest.py. Run as:

$ python3 outputs_3/suggest.py --creative cmidfpdzz02k5kz0cwmel029n --window 5
$ python3 outputs_3/suggest.py --worst-d7 5 # 5 worst D7 creatives
$ python3 outputs_3/suggest.py --concept R1617 # all creatives in concept

Evidence tiers (used everywhere in the engine)

Tier	Rule	Use for
HIGH_CONFIDENCE	p < 0.05 AND n ≥ 30	Direct recommendations, action plans
MEDIUM_CONFIDENCE	p < 0.10 AND n ≥ 15	Hypothesis generation, A/B tests
EXPLORATORY	weaker	Discussion, qualitative reads
INSUFFICIENT_DATA	missing p-value or n	Skip

9 · Data Quality & Limitations

Honesty matters more than confidence. The items below are what this analysis cannot do today and the things in the data that warrant skepticism. Stating them up front is the only way recommendations downstream are trustworthy.

The long-run intent. Each item below is a known limitation, not an accepted compromise. The biweekly cycle is designed to address them over time:

Concept fixed effects become possible once the structured per-cycle peripheral variation generates enough controlled comparisons (see §6.6) — currently planned for cycle 5+.
The empty recommendation_outcomes.csv is exactly what the hypothesis tracker (§6.6) is built to populate. Each cycle's resolved hypotheses feed the loop.
Producer routing causality improves as we deliberately rotate producers across DNA tags via the EXPLOIT/EXPLORE routing options.
Tag confounding declines as we deliberately ship the same peripheral swap on multiple concepts, breaking the concept-tag correlation structure.
Data hygiene issues (typos, missing role labels, untagged creatives) are upstream-fixable and tracked separately.

The framework's value isn't its current accuracy — it's that every cycle measurably narrows the gap.

9.1 What this analysis does NOT claim

Causal effects. All lifts are observational. Concepts that already perform well tend to use specific tags — causation runs both ways.
Concept-controlled tag effects. tag_metric_correlations.csv and marginal_effects.csv do NOT control for concept. With concept explaining 40.5% of variance, this matters.
Producer-specific causal claims. "Nikola lifts Cognitive Challenge by 7.7%" partly reflects which concepts Nikola was assigned, not pure execution skill.
Predictions. R² on the multi-tag OLS model (marginal_effects.csv) is 0.13. This is a guidance system, not a prediction model.

9.2 Sample-size flags by analysis

Analysis	n	Caveat
Variance decomposition	789–916 D7 rows	Healthy — well-powered.
tag_metric_correlations	varies by tag (some <15)	Use `evidence_tier` to filter.
tag_swap_recommendations	2–10 versions per swap	Most CIs span zero; filter `is_high_confidence=1`.
Producer × tag lifts	often n=10–25	Only 8 producers total; peer baseline is small.
Concept fixed effects	not computed	Major upgrade opportunity.
Inheritance learning	14% ancestor coverage	"Net learning ≈ zero" finding is suspended-judgment.

9.3 Known data hygiene issues

Typo splits a tag bucket: "Divergent / Fake Gampeplay" (n=2, typo) and "Divergent / Fake Gameplay" (n=46) are the same value with different labels.
Producer role is constant: s_created_by_user_role = "Internal" for all 1,078 rows — no internal/external producer dimension is analyzable.
recommendation_outcomes.csv is empty: the closed-loop tracker exists but has never been populated. Engine cannot self-evaluate.
84 untagged creatives (8.4% of total), including 6 benchmarks. Tag coverage = 91.6%.

9.4 Confounding warnings (Cramér's V)

16 of 36 tag pairs show V > 0.5 (severe multicollinearity). E.g., strategic_angle ↔ gameplay_narrative_arc = 0.55.
4 of 9 tags are confounded with producer at V > 0.5 — tag effects may just reflect producer style.
5 of 9 tags are temporally confounded with test windows (V > 0.3) — usage shifts over time.

10 · Roadmap & Open Questions

10.1 Highest-leverage upgrades

Add concept fixed effects to tag_metric_correlations.csv and marginal_effects.csv. Will reduce most "tag lift" claims, surface the few that survive — those are the real ones.
Close the recommendation_outcomes loop. Track which suggestions producers acted on, what happened. Without this we can't measure engine quality.
Increase ancestor coverage for inheritance_learning_analysis. Currently 14% — push to ≥50% to make iteration claims trustworthy.
Build per-creative explainability beyond the 4-component decomposition. Currently we say "IPM is low"; we don't say "IPM is low because the hook plays at 1.4 sec instead of 0.8."

10.2 Open questions for the team

Are CONSISTENT WINNERS being deliberately under-varied? Most have no within-concept swap evidence — they're not teaching us anything.
What fraction of "underexplored" recipes are recipes the team intentionally avoids (brand reasons)? Without this filter, the EXPLORE list overstates opportunity.

11 · Appendix — CSV Reference

Every file used in this report, where it lives, and which level uses it:

File	Path	Level	Purpose
`creative_level_analysis.csv`	`outputs_test_v2/`	L1, L3	Atomic creative-window facts. Source of truth for Lumina D7 + components.
`creative_decision_facts.csv`	`outputs_3/`	L3 (engine)	Per-creative Lumina decomposition + concept class join.
`lift_cube.csv`	`outputs_3/`	L3 (engine)	(tag × KPI × producer) → lift lookup with evidence tier.
`concept_dna_ranking.csv`	`outputs_3/`	L2	Mutual-info ranking of which tag categories define concept identity.
`dna_difficulty_matrix.csv`	`outputs_3/`	L2, L3	DNA tag values × difficulty × success rate × specialist routing.
`variance_breakdown_summary.csv`	`outputs_test_v2/tag_combinations/`	L2	What % of D7 variance each layer explains.
`variance_decomposition.json`	`outputs_test_v2/tag_combinations/producer_analytics/`	L2	ANOVA eta² for concept / producer / strategic_angle.
`concept_level_analysis.csv`	`outputs_test_v2/tag_combinations/`	L2	Concept-class taxonomy (CONSISTENT WINNER, HIGH-CEILING BET...)
`tag_metric_correlations.csv`	`outputs_test_v2/tag_combinations/`	L1, L3	(tag × KPI) lift table with p-values, used for chart bars.
`metric_drivers.csv`	`outputs_test_v2/tag_combinations/`	L3	Top-5 tag drivers per metric (IPM, ROAS, retention...).
`tag_execution_difficulty.csv`	`outputs_test_v2/tag_combinations/`	L2	Per-tag difficulty tier + success rate + ceiling/floor.
`tag_swap_recommendations.csv`	`outputs_test_v2/tag_combinations/`	L3	Within-concept tag swaps with 90% CIs.
`underexplored_combinations.csv`	`outputs_test_v2/tag_combinations/`	L4	9-tag recipes with low n + high Lumina percentile.
`production_concentration_d7.csv`	`outputs_3/`	L4	Lumina D7 version, benchmarks excluded. Replaces the upstream D0 file.
`production_concentration.csv`	`outputs_test_v2/tag_combinations/`	L4 (legacy)	Original D0 version — superseded for §7.1 by the D7 file above.
`weekly_lumina_d7_trajectory.csv`	`outputs_test_v2/tag_combinations/`	L1, L4	Portfolio-wide Lumina trend, WoW change.
`producer_overview.csv`	`outputs_test_v2/tag_combinations/producer_analytics/`	L2, L4	Producer composite_score + p_value vs peers.
`producer_difficulty_analysis.csv`	`outputs_test_v2/tag_combinations/producer_analytics/`	L3, L4	Hard Tag Specialists (Julie, Alexandre).
`insights_tag_*.csv`	`outputs_test_v2/`	L1	Per-tag-category aggregates (avg_ipm, avg_lumina_d7, n).

Creative Intelligence — Creative Learnings & Production Iteration (tag system). Analyst Edition.
Designed and built by Andres Mendoza · Growth Data Analyst, Central Hybrid Team · Creative analytics system.
Built from outputs_test_v2/ + outputs_3/. All claims reproducible from CSVs. All limitations stated in §9.
Next iterations: add concept fixed effects, close recommendation_outcomes loop, increase ancestor coverage.