scoring-calibration.md
1 # Scoring Calibration — Findings & Decisions 2 3 ## Current Scoring Architecture 4 5 The pipeline uses a two-pass approach: 6 7 1. **Programmatic scoring** (`scoring.js` → `programmatic-scorer.js`): runs locally on HTML DOM, no API cost. Sets status → `semantic_scored` (vision off) or `prog_scored` (vision on). 8 2. **Semantic re-scoring** (`score_semantic` orchestrator batch): Sonnet LLM re-scores `headline_quality`, `value_proposition`, `unique_selling_proposition` on top of the programmatic base. Overwrites those three factor scores in `score_json`. 9 10 `score_sites` (full LLM scoring via `ENABLE_LLM_SCORING=true + ENABLE_VISION=false`) is a third mode that bypasses programmatic scoring entirely and scores all factors via Claude Max (`claude -p` batch). It is currently disabled. 11 12 --- 13 14 ## Calibration Run — 2026-03-18 15 16 **Script:** `scripts/calibrate-scorer.js` 17 **Sample:** n=495 English-language sites with `score IS NOT NULL AND html_dom IS NOT NULL` 18 19 ### Overall Metrics 20 21 | Metric | Value | Target | 22 | ----------------------- | ------------ | ------ | 23 | R² | **-1.21** | ≥ 0.75 | 24 | Pearson r | 0.215 | — | 25 | MAE | **11.5 pts** | — | 26 | LLM mean score | 59.6 | — | 27 | Programmatic mean score | 55.5 | — | 28 | Mean diff (prog − LLM) | −4.2 pts | — | 29 30 **Result: FAIL.** Programmatic scorer diverges significantly from LLM ground truth. R² of −1.21 means programmatic scores are _worse than predicting the mean for every site_. The correlation is weak (r=0.215). 31 32 ### Per-Factor Divergence (sorted by MAE) 33 34 | Factor | Mean Diff (prog − LLM) | MAE | Notes | 35 | ---------------------------- | ---------------------- | -------- | ---------------------------------------------------------------------- | 36 | `headline_quality` | −2.43 | **2.74** | Worst factor. Prog under-scores — misses semantic quality of headlines | 37 | `offer_clarity` | +1.07 | 1.97 | Prog over-scores | 38 | `call_to_action` | −1.34 | 1.95 | Prog under-scores | 39 | `trust_signals` | −0.28 | 1.93 | Close to neutral but variable | 40 | `unique_selling_proposition` | +1.18 | 1.80 | Prog over-scores | 41 | `value_proposition` | +0.34 | 1.69 | Near neutral | 42 | `urgency_messaging` | −0.69 | 1.49 | Slight under-score | 43 | `imagery_design` | +0.84 | 1.46 | Slight over-score | 44 | `hook_engagement` | +0.30 | 1.21 | Near neutral | 45 | `contextual_appropriateness` | +0.34 | 1.17 | Best factor — programmatic handles this well | 46 47 ### Score Distribution 48 49 | Grade | LLM | Programmatic | 50 | --------- | --------- | ------------ | 51 | F (0–59) | 246 (50%) | 318 (64%) | 52 | D (60–69) | 229 (46%) | 119 (24%) | 53 | C (70–79) | 20 (4%) | 52 (11%) | 54 | B (80–89) | 0 | 6 (1%) | 55 | A (90+) | 0 | 0 | 56 57 Programmatic pushes more sites into F and C — it has a bimodal distribution vs LLM's tighter D-range cluster. 58 59 ### Worst Outliers 60 61 | Domain | LLM Score | Prog Score | Diff | 62 | ----------------------- | --------- | ---------- | ------------------ | 63 | live2500kelly.com | 11.1 | 52.3 | +41.2 (prog over) | 64 | therenoguys.ca | 67.8 | 28.8 | −39.0 (prog under) | 65 | wineparis.com | 58.8 | 22.8 | −36.0 (prog under) | 66 | sealcoatsolutionsma.com | 49.0 | 85.0 | +36.0 (prog over) | 67 | abpainting.contractors | 42.7 | 78.5 | +35.8 (prog over) | 68 69 Programmatic over-scores structurally clean but semantically thin sites (e.g. sparse contractor pages with a phone number and nav). It under-scores sites with rich copy that doesn't follow predictable patterns (e.g. wine merchants, reno companies with long-form text). 70 71 --- 72 73 ## Decision: Keep LLM Scoring 74 75 **Programmatic scoring is not a viable replacement for LLM scoring.** The R² of −1.21 and MAE of 11.5 points means: 76 77 - Sites near the `LOW_SCORE_CUTOFF=82` boundary would frequently be misclassified (sent proposals when they shouldn't, or skipped when they should get proposals) 78 - `headline_quality` — the factor most important for proposal copy — is the worst-performing factor programmatically 79 80 **What programmatic scoring is still good for:** 81 82 - Fast first-pass filtering of obviously broken/error pages (`is_error_page`, `is_broken_site`) 83 - Country detection and contact extraction (unrelated to scoring quality) 84 - `contextual_appropriateness` (MAE=1.17) — structural signals work here 85 86 **Current mode (confirmed correct):** 87 88 - `ENABLE_VISION=false` → scoring stage routes `assets_captured` → `semantic_scored` via programmatic scorer 89 - `score_semantic` orchestrator batch overwrites `headline_quality`, `value_proposition`, `unique_selling_proposition` with Sonnet LLM scores 90 - `ENABLE_LLM_SCORING` is not set in `.env` (defaults to `true` per `.env.example`) but `score_sites` batch is disabled because `ENABLE_VISION` is not `false` in its guard logic — **this means full LLM scoring via score_sites is not running** 91 92 **The semantic re-scoring pass (`score_semantic`) is essential** — it corrects the three worst programmatic factors. The pipeline must keep it running. It was incorrectly blocked by Gate 2; this has been fixed (2026-03-18 commit e77cf5ea). 93 94 --- 95 96 ## Gate 2 Fix (2026-03-18) 97 98 `score_semantic` was incorrectly included in Gate 2 (proposals_drafted backlog gate), causing it to be permanently blocked when `proposals_drafted > 45` (which is always the case with 5,600+ drafted proposals). 99 100 **Fix:** Removed `score_semantic` from Gate 2. It is now only subject to Gate 3 (enriched backlog). This unblocks semantic re-scoring of the 25,449 `semantic_scored` sites. 101 102 Batch size also raised from 20 → 50 to clear the backlog faster.