/ docs / 02-architecture / scoring-calibration.md
scoring-calibration.md
  1  # Scoring Calibration — Findings & Decisions
  2  
  3  ## Current Scoring Architecture
  4  
  5  The pipeline uses a two-pass approach:
  6  
  7  1. **Programmatic scoring** (`scoring.js` → `programmatic-scorer.js`): runs locally on HTML DOM, no API cost. Sets status → `semantic_scored` (vision off) or `prog_scored` (vision on).
  8  2. **Semantic re-scoring** (`score_semantic` orchestrator batch): Sonnet LLM re-scores `headline_quality`, `value_proposition`, `unique_selling_proposition` on top of the programmatic base. Overwrites those three factor scores in `score_json`.
  9  
 10  `score_sites` (full LLM scoring via `ENABLE_LLM_SCORING=true + ENABLE_VISION=false`) is a third mode that bypasses programmatic scoring entirely and scores all factors via Claude Max (`claude -p` batch). It is currently disabled.
 11  
 12  ---
 13  
 14  ## Calibration Run — 2026-03-18
 15  
 16  **Script:** `scripts/calibrate-scorer.js`
 17  **Sample:** n=495 English-language sites with `score IS NOT NULL AND html_dom IS NOT NULL`
 18  
 19  ### Overall Metrics
 20  
 21  | Metric                  | Value        | Target |
 22  | ----------------------- | ------------ | ------ |
 23  | R²                      | **-1.21**    | ≥ 0.75 |
 24  | Pearson r               | 0.215        | —      |
 25  | MAE                     | **11.5 pts** | —      |
 26  | LLM mean score          | 59.6         | —      |
 27  | Programmatic mean score | 55.5         | —      |
 28  | Mean diff (prog − LLM)  | −4.2 pts     | —      |
 29  
 30  **Result: FAIL.** Programmatic scorer diverges significantly from LLM ground truth. R² of −1.21 means programmatic scores are _worse than predicting the mean for every site_. The correlation is weak (r=0.215).
 31  
 32  ### Per-Factor Divergence (sorted by MAE)
 33  
 34  | Factor                       | Mean Diff (prog − LLM) | MAE      | Notes                                                                  |
 35  | ---------------------------- | ---------------------- | -------- | ---------------------------------------------------------------------- |
 36  | `headline_quality`           | −2.43                  | **2.74** | Worst factor. Prog under-scores — misses semantic quality of headlines |
 37  | `offer_clarity`              | +1.07                  | 1.97     | Prog over-scores                                                       |
 38  | `call_to_action`             | −1.34                  | 1.95     | Prog under-scores                                                      |
 39  | `trust_signals`              | −0.28                  | 1.93     | Close to neutral but variable                                          |
 40  | `unique_selling_proposition` | +1.18                  | 1.80     | Prog over-scores                                                       |
 41  | `value_proposition`          | +0.34                  | 1.69     | Near neutral                                                           |
 42  | `urgency_messaging`          | −0.69                  | 1.49     | Slight under-score                                                     |
 43  | `imagery_design`             | +0.84                  | 1.46     | Slight over-score                                                      |
 44  | `hook_engagement`            | +0.30                  | 1.21     | Near neutral                                                           |
 45  | `contextual_appropriateness` | +0.34                  | 1.17     | Best factor — programmatic handles this well                           |
 46  
 47  ### Score Distribution
 48  
 49  | Grade     | LLM       | Programmatic |
 50  | --------- | --------- | ------------ |
 51  | F (0–59)  | 246 (50%) | 318 (64%)    |
 52  | D (60–69) | 229 (46%) | 119 (24%)    |
 53  | C (70–79) | 20 (4%)   | 52 (11%)     |
 54  | B (80–89) | 0         | 6 (1%)       |
 55  | A (90+)   | 0         | 0            |
 56  
 57  Programmatic pushes more sites into F and C — it has a bimodal distribution vs LLM's tighter D-range cluster.
 58  
 59  ### Worst Outliers
 60  
 61  | Domain                  | LLM Score | Prog Score | Diff               |
 62  | ----------------------- | --------- | ---------- | ------------------ |
 63  | live2500kelly.com       | 11.1      | 52.3       | +41.2 (prog over)  |
 64  | therenoguys.ca          | 67.8      | 28.8       | −39.0 (prog under) |
 65  | wineparis.com           | 58.8      | 22.8       | −36.0 (prog under) |
 66  | sealcoatsolutionsma.com | 49.0      | 85.0       | +36.0 (prog over)  |
 67  | abpainting.contractors  | 42.7      | 78.5       | +35.8 (prog over)  |
 68  
 69  Programmatic over-scores structurally clean but semantically thin sites (e.g. sparse contractor pages with a phone number and nav). It under-scores sites with rich copy that doesn't follow predictable patterns (e.g. wine merchants, reno companies with long-form text).
 70  
 71  ---
 72  
 73  ## Decision: Keep LLM Scoring
 74  
 75  **Programmatic scoring is not a viable replacement for LLM scoring.** The R² of −1.21 and MAE of 11.5 points means:
 76  
 77  - Sites near the `LOW_SCORE_CUTOFF=82` boundary would frequently be misclassified (sent proposals when they shouldn't, or skipped when they should get proposals)
 78  - `headline_quality` — the factor most important for proposal copy — is the worst-performing factor programmatically
 79  
 80  **What programmatic scoring is still good for:**
 81  
 82  - Fast first-pass filtering of obviously broken/error pages (`is_error_page`, `is_broken_site`)
 83  - Country detection and contact extraction (unrelated to scoring quality)
 84  - `contextual_appropriateness` (MAE=1.17) — structural signals work here
 85  
 86  **Current mode (confirmed correct):**
 87  
 88  - `ENABLE_VISION=false` → scoring stage routes `assets_captured` → `semantic_scored` via programmatic scorer
 89  - `score_semantic` orchestrator batch overwrites `headline_quality`, `value_proposition`, `unique_selling_proposition` with Sonnet LLM scores
 90  - `ENABLE_LLM_SCORING` is not set in `.env` (defaults to `true` per `.env.example`) but `score_sites` batch is disabled because `ENABLE_VISION` is not `false` in its guard logic — **this means full LLM scoring via score_sites is not running**
 91  
 92  **The semantic re-scoring pass (`score_semantic`) is essential** — it corrects the three worst programmatic factors. The pipeline must keep it running. It was incorrectly blocked by Gate 2; this has been fixed (2026-03-18 commit e77cf5ea).
 93  
 94  ---
 95  
 96  ## Gate 2 Fix (2026-03-18)
 97  
 98  `score_semantic` was incorrectly included in Gate 2 (proposals_drafted backlog gate), causing it to be permanently blocked when `proposals_drafted > 45` (which is always the case with 5,600+ drafted proposals).
 99  
100  **Fix:** Removed `score_semantic` from Gate 2. It is now only subject to Gate 3 (enriched backlog). This unblocks semantic re-scoring of the 25,449 `semantic_scored` sites.
101  
102  Batch size also raised from 20 → 50 to clear the backlog faster.