/ FINALPRODUCTIMPROVEMENTS.md
FINALPRODUCTIMPROVEMENTS.md
1 # Final Product Improvements — Sprint Plan & Architecture 2 3 ## Infrastructure Reality Check 4 5 Before planning, here's what already exists that matters: 6 7 | Asset | Status | Implication | 8 |---|---|---| 9 | `analyses` table (full JSONB) | ✅ Supabase | Analyses are already stored — feedback loop is feasible | 10 | `feedback` table (`analysis_id` FK) | ✅ Supabase | Can join feedback to output for retraining | 11 | `lib/knowledge/` (salary, skills, linkedin playbook, cv-practices) | ✅ Exists | Golden examples slot naturally alongside this | 12 | A/B testing | ❌ Nothing | Build from scratch | 13 | Golden examples | ❌ Nothing | Build from scratch | 14 | Quality CI | ✅ GitHub Actions (lint/type/build only) | Extend, don't rebuild | 15 | Feedback `selectedIssues` | ✅ Stored | Already collecting structured issue types | 16 17 --- 18 19 ## Feature-by-Feature Architecture Analysis 20 21 --- 22 23 ### Feature 1: Golden Examples 24 25 **What "golden examples" actually do in an LLM pipeline:** 26 They are few-shot examples injected into system prompts. They work because Claude pattern-matches quality — showing it 2 excellent cover letters makes it write closer to that standard than any instruction can. This is the highest-ROI change you can make to output quality, measurably. 27 28 **Scope: 3 types, 5 each:** 29 - **LinkedIn**: About section, Headline, Summary-of-skills block 30 - **CV**: Profile summary + 3 quantified experience bullets per example 31 - **Cover letter**: Full 3-paragraph letter (matches the ≤3 paragraph rule) 32 33 **File structure:** 34 ``` 35 lib/golden-examples/ 36 linkedin-about.ts # 5 examples with metadata 37 linkedin-headline.ts # 5 examples 38 cv-profiles.ts # 5 CV summaries + bullet sets 39 cover-letters.ts # 5 cover letters 40 select.ts # Role/industry keyword matching (no embeddings needed yet) 41 ``` 42 43 **The selection logic:** simple keyword matching on `targetRole` + `industry`. Pick 1–2 examples per prompt injection. Embeddings/semantic search is a later optimization — keyword matching covers 80% of the value at 5% of the cost. 44 45 **Format of each example:** 46 ```typescript 47 interface GoldenExample { 48 id: string; 49 role: string; // "Product Manager" 50 industry: string; // "tech" | "finance" | "healthcare" | "general" 51 seniority: string; // "senior" | "mid" | "junior" | "any" 52 content: string; // The actual text 53 whyItWorks: string; // 1 sentence — used in judge prompts + system prompt commentary 54 keywords: string[]; // For matching 55 } 56 ``` 57 58 **Integration points:** 59 1. `lib/prompts/cover-letter.ts` — inject 1 matching example with `[WHY IT WORKS: ...]` annotation 60 2. LinkedIn plan generation — inject 1 About + 1 Headline example 61 3. `lib/prompts/gap-analysis.ts` CV suggestions block — inject 1 CV bullets example 62 63 **Content curation note:** This is the hardest part and the most important. Bad golden examples actively hurt output quality. Each example needs to pass a human review against a checklist: quantified impact, no buzzwords, industry-specific language, correct length, no AI-sounding phrases. 64 65 --- 66 67 ### Feature 2: A/B Testing 68 69 **Two levels, built independently:** 70 71 **Level A — Prompt A/B (what matters most):** 72 Test hypothesis: *"Golden example injection improves output quality as measured by positive feedback rate."* 73 74 Architecture: 75 ``` 76 New Supabase table: experiments 77 experiment_id TEXT, 78 user_id_hash TEXT, -- SHA-256(user_id + experiment_id), never raw user_id 79 variant TEXT, -- 'control' | 'treatment' 80 analysis_id UUID FK, 81 created_at 82 ``` 83 84 Variant assignment: deterministic hash — `sha256(userId + experimentId) % 2`. Same user always gets the same variant. Stable across sessions. No cookies needed server-side. 85 86 Tracking: add `metadata.experimentVariant` to the SSE `complete` event and to the stored `analyses` row. 87 88 Metrics query (raw Supabase SQL): 89 ```sql 90 SELECT 91 e.variant, 92 COUNT(*) as analyses, 93 AVG(f.rating::int) as positive_rate 94 FROM experiments e 95 JOIN feedback f ON e.analysis_id = f.analysis_id 96 GROUP BY e.variant; 97 ``` 98 99 **Level B — UI A/B (secondary, lower priority):** 100 Next.js middleware sets a cookie on first visit, Vercel Analytics tracks page events per variant. Useful for landing page headline tests, CTA copy. Low engineering effort once Level A is built. 101 102 **What to test first:** The single most valuable first experiment is **golden examples injection vs. control** (Feature 1 vs. baseline). This ties Features 1 and 2 together in the same sprint. 103 104 --- 105 106 ### Feature 3: Automated Testing Against Golden Examples 107 108 **What this is NOT:** it is not testing React components. It is **quality regression testing** — given a synthetic CV + job posting, run the full pipeline and judge whether outputs meet quality standards using Claude-as-judge. 109 110 **Why this matters:** Every prompt change is a quality change. Without regression tests, you are flying blind. A prompt that makes cover letters shorter might also make them worse. You need to catch that before deploying. 111 112 **Test harness:** 113 ``` 114 __tests__/quality/ 115 fixtures/ 116 01-senior-engineer/ # CV + JD + expected thresholds 117 02-product-manager/ 118 03-career-changer/ 119 04-junior-developer/ 120 05-non-tech-role/ # covers breadth 121 run-quality.test.ts # Vitest integration, calls real API 122 judge.ts # Claude-as-judge prompt 123 rubric.ts # Scoring dimensions + thresholds 124 ``` 125 126 **Rubric (5 dimensions, 1–5 scale each):** 127 1. **Specificity** — Uses language from the actual JD, not generic 128 2. **Accuracy** — No hallucinated facts, no invented experience 129 3. **Actionability** — Reader knows exactly what to do next 130 4. **Compactness** — Right length, no filler 131 5. **Quality vs golden** — Comparable to the injected golden example 132 133 Pass = average ≥ 3.5/5. Fail = any single dimension ≤ 2. 134 135 **Run schedule:** Weekly (Monday 6am via cron) + on `workflow_dispatch` for manual trigger. NOT on every PR — too slow and too expensive. CI on PRs stays lint/typecheck/unit-tests only. 136 137 **Cost per run:** 5 fixtures × 3 sections each × 1 judge call = ~15 Claude calls ≈ $0.50–1.50 per week. 138 139 **Report format:** JSON to file + GitHub Actions summary table. Optionally pipe to Slack webhook. 140 141 --- 142 143 ### Feature 4: Feedback Collection → Monthly Retraining 144 145 **Critical framing:** Claude fine-tuning is not publicly available. "Retraining" here means **prompt curation** — high-rated real-user outputs get promoted to golden examples, which get injected into future prompts. This is actually better than fine-tuning for this use case because: 146 - It is interpretable (you can read what is being promoted) 147 - It is controllable (human approval gate) 148 - It is fast (no training cycle) 149 - It compounds: each month's golden examples are better than last month's 150 151 **The full data pipeline already partially exists:** 152 - `analyses` table stores the full JSONB output ✓ 153 - `feedback` table has `analysis_id` FK ✓ 154 - `feedback` stores `selected_issues` ✓ 155 156 **What's missing:** 157 1. The export + curation script 158 2. The human review step 159 3. The promotion script that updates golden example files 160 161 **Monthly loop:** 162 ``` 163 Week 1 of month: 164 1. Run: node scripts/export-golden-candidates.cjs 165 → Queries: analyses JOIN feedback WHERE rating=true AND selected_issues='[]' 166 → Groups by section type (cover_letter, linkedin_about, etc.) 167 → Exports top 20 candidates per section to review/YYYY-MM/candidates.md 168 169 2. Human reviews candidates.md (15–30 min) 170 → Annotates: APPROVE / REJECT / NEEDS_EDIT 171 172 3. Run: node scripts/promote-to-golden.cjs --month=YYYY-MM 173 → Reads approved candidates 174 → Updates lib/golden-examples/*.ts 175 → Creates PR for review 176 177 4. Merge PR → triggers quality test run → confirm no regressions 178 ``` 179 180 **Privacy note:** Full analysis results are already stored. Before promoting user outputs to golden examples, verify the Privacy Policy covers internal quality improvement. Add a clause if not already present. 181 182 --- 183 184 ## Sprint Plan 185 186 **Total: 4 sprints, ~1 week each** 187 188 --- 189 190 ### Sprint 1 — Golden Examples Foundation 191 192 **Goal:** Build the library, inject into prompts, deploy. Immediate quality improvement with no new infrastructure. 193 194 | # | Task | Effort | 195 |---|---|---| 196 | 1.1 | Create `lib/golden-examples/` structure and TypeScript types | S | 197 | 1.2 | Write + curate 5 LinkedIn About examples (tech, finance, mid-career, senior, career changer) | M | 198 | 1.3 | Write + curate 5 LinkedIn Headline examples | S | 199 | 1.4 | Write + curate 5 CV profile summaries + bullet sets | M | 200 | 1.5 | Write + curate 5 cover letters (aligned with 3-para rule + no AI chars) | M | 201 | 1.6 | Build `select.ts` — keyword-based 1–2 example picker | S | 202 | 1.7 | Inject into `cover-letter.ts` prompt | S | 203 | 1.8 | Inject into LinkedIn plan generation | S | 204 | 1.9 | Inject into CV suggestions block | S | 205 | 1.10 | Manual QA: run 3 test analyses, verify richer output | S | 206 207 **Deliverable:** Noticeably better cover letters and LinkedIn sections. Zero schema changes. 208 209 **Risk:** Content curation (1.2–1.5) is the bottleneck. Set aside 3–4 hours for writing quality examples. 210 211 --- 212 213 ### Sprint 2 — A/B Testing Infrastructure 214 215 **Goal:** Variant assignment, tracking, and first live experiment (golden examples vs. control). 216 217 | # | Task | Effort | 218 |---|---|---| 219 | 2.1 | Supabase migration: `experiments` table | S | 220 | 2.2 | `lib/ab-testing.ts`: `assignVariant()`, `trackExperiment()` | S | 221 | 2.3 | Integrate variant assignment into `analyze-stream/route.ts` | S | 222 | 2.4 | Add `experimentVariant` to analysis metadata + `analyses` stored row | S | 223 | 2.5 | Configure first experiment: `golden-examples-v1` (control = no examples) | S | 224 | 2.6 | UI A/B middleware (Next.js edge) for landing page CTA test | M | 225 | 2.7 | Internal dashboard SQL query for variant comparison | S | 226 | 2.8 | Document experiment protocol (how to read results, significance thresholds) | S | 227 228 **Deliverable:** A/B experiment live. Data flowing. First results in ~2 weeks. 229 230 --- 231 232 ### Sprint 3 — Automated Quality Testing 233 234 **Goal:** Weekly CI quality gate that catches prompt regressions before they reach users. 235 236 | # | Task | Effort | 237 |---|---|---| 238 | 3.1 | Write 5 synthetic CV + JD test fixtures (realistic, privacy-safe) | M | 239 | 3.2 | Build `judge.ts` — Claude-as-judge prompt with 5-dimension rubric | M | 240 | 3.3 | Build `rubric.ts` — scoring logic, pass/fail thresholds | S | 241 | 3.4 | Build `run-quality.test.ts` — fixture runner, calls real API | M | 242 | 3.5 | GitHub Actions `quality.yml` (weekly cron + `workflow_dispatch`) | S | 243 | 3.6 | Quality report: JSON + GitHub Actions summary table | S | 244 | 3.7 | Test both A/B variants in quality runner (baseline vs. golden examples) | S | 245 | 3.8 | Set initial quality baselines from first run | S | 246 247 **Deliverable:** Weekly quality report. Any future prompt change has a measurable quality impact. 248 249 --- 250 251 ### Sprint 4 — Monthly Retraining Pipeline 252 253 **Goal:** Close the loop. Real user feedback feeds back into golden examples. 254 255 | # | Task | Effort | 256 |---|---|---| 257 | 4.1 | Audit feedback table: ensure `analysis_id` always populated | S | 258 | 4.2 | `scripts/export-golden-candidates.cjs` — SQL join + section extraction | M | 259 | 4.3 | Human review format: `review/YYYY-MM/candidates.md` with APPROVE/REJECT | S | 260 | 4.4 | `scripts/promote-to-golden.cjs` — reads approvals, updates JSON files | M | 261 | 4.5 | Monthly GitHub Actions cron job | S | 262 | 4.6 | Run first export on existing data | S | 263 | 4.7 | Review + promote first real-user examples | S | 264 | 4.8 | Privacy policy check / update for internal quality use | S | 265 | 4.9 | Write "Monthly Quality Review" runbook | S | 266 267 **Deliverable:** First real-world golden examples promoted. The system self-improves. 268 269 --- 270 271 ## Dependency Graph 272 273 ``` 274 Sprint 1 (Golden Examples) 275 │ 276 ├──→ Sprint 2 (A/B) — uses Sprint 1 examples as the "treatment" 277 │ │ 278 │ └──→ Sprint 4 (Retraining) — needs Sprint 2 analysis_id tracking in experiments 279 │ 280 └──→ Sprint 3 (Quality Tests) — uses Sprint 1 examples as benchmarks 281 │ 282 └──→ Sprint 4 — quality tests validate each monthly promotion 283 ``` 284 285 Sprint 3 can start in parallel with Sprint 2 after Sprint 1 is done. 286 287 --- 288 289 ## Key Decisions to Make Before Starting 290 291 1. **Content curation ownership**: Who writes the 5 golden examples per category? This is the most important decision — they should be written by someone who has reviewed dozens of real LinkedIn profiles and CVs, not generated by Claude (irony aside). 292 293 2. **A/B traffic threshold**: With low/moderate traffic, you need ~100 analyses per variant to see meaningful signal. Set a minimum run duration (2 weeks) before reading results. 294 295 3. **What counts as "retraining-eligible" feedback**: Currently `rating=true AND selectedIssues=[]` (positive, no issues). Should you also promote `rating=false` outputs with specific issues as negative examples? Recommend: start positive-only, add negative examples in a later iteration. 296 297 4. **Golden examples for non-English users**: The app supports 6 languages. Golden examples in English only will slightly dilute the few-shot benefit for non-English runs. Acceptable for now — address in a later sprint when real translated examples exist.