Cradicle Explorer

/ FINALPRODUCTIMPROVEMENTS.md
FINALPRODUCTIMPROVEMENTS.md
  1  # Final Product Improvements — Sprint Plan & Architecture
  2  
  3  ## Infrastructure Reality Check
  4  
  5  Before planning, here's what already exists that matters:
  6  
  7  | Asset | Status | Implication |
  8  |---|---|---|
  9  | `analyses` table (full JSONB) | ✅ Supabase | Analyses are already stored — feedback loop is feasible |
 10  | `feedback` table (`analysis_id` FK) | ✅ Supabase | Can join feedback to output for retraining |
 11  | `lib/knowledge/` (salary, skills, linkedin playbook, cv-practices) | ✅ Exists | Golden examples slot naturally alongside this |
 12  | A/B testing | ❌ Nothing | Build from scratch |
 13  | Golden examples | ❌ Nothing | Build from scratch |
 14  | Quality CI | ✅ GitHub Actions (lint/type/build only) | Extend, don't rebuild |
 15  | Feedback `selectedIssues` | ✅ Stored | Already collecting structured issue types |
 16  
 17  ---
 18  
 19  ## Feature-by-Feature Architecture Analysis
 20  
 21  ---
 22  
 23  ### Feature 1: Golden Examples
 24  
 25  **What "golden examples" actually do in an LLM pipeline:**
 26  They are few-shot examples injected into system prompts. They work because Claude pattern-matches quality — showing it 2 excellent cover letters makes it write closer to that standard than any instruction can. This is the highest-ROI change you can make to output quality, measurably.
 27  
 28  **Scope: 3 types, 5 each:**
 29  - **LinkedIn**: About section, Headline, Summary-of-skills block
 30  - **CV**: Profile summary + 3 quantified experience bullets per example
 31  - **Cover letter**: Full 3-paragraph letter (matches the ≤3 paragraph rule)
 32  
 33  **File structure:**
 34  ```
 35  lib/golden-examples/
 36    linkedin-about.ts       # 5 examples with metadata
 37    linkedin-headline.ts    # 5 examples
 38    cv-profiles.ts          # 5 CV summaries + bullet sets
 39    cover-letters.ts        # 5 cover letters
 40    select.ts               # Role/industry keyword matching (no embeddings needed yet)
 41  ```
 42  
 43  **The selection logic:** simple keyword matching on `targetRole` + `industry`. Pick 1–2 examples per prompt injection. Embeddings/semantic search is a later optimization — keyword matching covers 80% of the value at 5% of the cost.
 44  
 45  **Format of each example:**
 46  ```typescript
 47  interface GoldenExample {
 48    id: string;
 49    role: string;           // "Product Manager"
 50    industry: string;       // "tech" | "finance" | "healthcare" | "general"
 51    seniority: string;      // "senior" | "mid" | "junior" | "any"
 52    content: string;        // The actual text
 53    whyItWorks: string;     // 1 sentence — used in judge prompts + system prompt commentary
 54    keywords: string[];     // For matching
 55  }
 56  ```
 57  
 58  **Integration points:**
 59  1. `lib/prompts/cover-letter.ts` — inject 1 matching example with `[WHY IT WORKS: ...]` annotation
 60  2. LinkedIn plan generation — inject 1 About + 1 Headline example
 61  3. `lib/prompts/gap-analysis.ts` CV suggestions block — inject 1 CV bullets example
 62  
 63  **Content curation note:** This is the hardest part and the most important. Bad golden examples actively hurt output quality. Each example needs to pass a human review against a checklist: quantified impact, no buzzwords, industry-specific language, correct length, no AI-sounding phrases.
 64  
 65  ---
 66  
 67  ### Feature 2: A/B Testing
 68  
 69  **Two levels, built independently:**
 70  
 71  **Level A — Prompt A/B (what matters most):**
 72  Test hypothesis: *"Golden example injection improves output quality as measured by positive feedback rate."*
 73  
 74  Architecture:
 75  ```
 76  New Supabase table: experiments
 77    experiment_id TEXT,
 78    user_id_hash TEXT,    -- SHA-256(user_id + experiment_id), never raw user_id
 79    variant TEXT,         -- 'control' | 'treatment'
 80    analysis_id UUID FK,
 81    created_at
 82  ```
 83  
 84  Variant assignment: deterministic hash — `sha256(userId + experimentId) % 2`. Same user always gets the same variant. Stable across sessions. No cookies needed server-side.
 85  
 86  Tracking: add `metadata.experimentVariant` to the SSE `complete` event and to the stored `analyses` row.
 87  
 88  Metrics query (raw Supabase SQL):
 89  ```sql
 90  SELECT
 91    e.variant,
 92    COUNT(*) as analyses,
 93    AVG(f.rating::int) as positive_rate
 94  FROM experiments e
 95  JOIN feedback f ON e.analysis_id = f.analysis_id
 96  GROUP BY e.variant;
 97  ```
 98  
 99  **Level B — UI A/B (secondary, lower priority):**
100  Next.js middleware sets a cookie on first visit, Vercel Analytics tracks page events per variant. Useful for landing page headline tests, CTA copy. Low engineering effort once Level A is built.
101  
102  **What to test first:** The single most valuable first experiment is **golden examples injection vs. control** (Feature 1 vs. baseline). This ties Features 1 and 2 together in the same sprint.
103  
104  ---
105  
106  ### Feature 3: Automated Testing Against Golden Examples
107  
108  **What this is NOT:** it is not testing React components. It is **quality regression testing** — given a synthetic CV + job posting, run the full pipeline and judge whether outputs meet quality standards using Claude-as-judge.
109  
110  **Why this matters:** Every prompt change is a quality change. Without regression tests, you are flying blind. A prompt that makes cover letters shorter might also make them worse. You need to catch that before deploying.
111  
112  **Test harness:**
113  ```
114  __tests__/quality/
115    fixtures/
116      01-senior-engineer/     # CV + JD + expected thresholds
117      02-product-manager/
118      03-career-changer/
119      04-junior-developer/
120      05-non-tech-role/       # covers breadth
121    run-quality.test.ts       # Vitest integration, calls real API
122    judge.ts                  # Claude-as-judge prompt
123    rubric.ts                 # Scoring dimensions + thresholds
124  ```
125  
126  **Rubric (5 dimensions, 1–5 scale each):**
127  1. **Specificity** — Uses language from the actual JD, not generic
128  2. **Accuracy** — No hallucinated facts, no invented experience
129  3. **Actionability** — Reader knows exactly what to do next
130  4. **Compactness** — Right length, no filler
131  5. **Quality vs golden** — Comparable to the injected golden example
132  
133  Pass = average ≥ 3.5/5. Fail = any single dimension ≤ 2.
134  
135  **Run schedule:** Weekly (Monday 6am via cron) + on `workflow_dispatch` for manual trigger. NOT on every PR — too slow and too expensive. CI on PRs stays lint/typecheck/unit-tests only.
136  
137  **Cost per run:** 5 fixtures × 3 sections each × 1 judge call = ~15 Claude calls ≈ $0.50–1.50 per week.
138  
139  **Report format:** JSON to file + GitHub Actions summary table. Optionally pipe to Slack webhook.
140  
141  ---
142  
143  ### Feature 4: Feedback Collection → Monthly Retraining
144  
145  **Critical framing:** Claude fine-tuning is not publicly available. "Retraining" here means **prompt curation** — high-rated real-user outputs get promoted to golden examples, which get injected into future prompts. This is actually better than fine-tuning for this use case because:
146  - It is interpretable (you can read what is being promoted)
147  - It is controllable (human approval gate)
148  - It is fast (no training cycle)
149  - It compounds: each month's golden examples are better than last month's
150  
151  **The full data pipeline already partially exists:**
152  - `analyses` table stores the full JSONB output ✓
153  - `feedback` table has `analysis_id` FK ✓
154  - `feedback` stores `selected_issues` ✓
155  
156  **What's missing:**
157  1. The export + curation script
158  2. The human review step
159  3. The promotion script that updates golden example files
160  
161  **Monthly loop:**
162  ```
163  Week 1 of month:
164    1. Run: node scripts/export-golden-candidates.cjs
165       → Queries: analyses JOIN feedback WHERE rating=true AND selected_issues='[]'
166       → Groups by section type (cover_letter, linkedin_about, etc.)
167       → Exports top 20 candidates per section to review/YYYY-MM/candidates.md
168  
169    2. Human reviews candidates.md (15–30 min)
170       → Annotates: APPROVE / REJECT / NEEDS_EDIT
171  
172    3. Run: node scripts/promote-to-golden.cjs --month=YYYY-MM
173       → Reads approved candidates
174       → Updates lib/golden-examples/*.ts
175       → Creates PR for review
176  
177    4. Merge PR → triggers quality test run → confirm no regressions
178  ```
179  
180  **Privacy note:** Full analysis results are already stored. Before promoting user outputs to golden examples, verify the Privacy Policy covers internal quality improvement. Add a clause if not already present.
181  
182  ---
183  
184  ## Sprint Plan
185  
186  **Total: 4 sprints, ~1 week each**
187  
188  ---
189  
190  ### Sprint 1 — Golden Examples Foundation
191  
192  **Goal:** Build the library, inject into prompts, deploy. Immediate quality improvement with no new infrastructure.
193  
194  | # | Task | Effort |
195  |---|---|---|
196  | 1.1 | Create `lib/golden-examples/` structure and TypeScript types | S |
197  | 1.2 | Write + curate 5 LinkedIn About examples (tech, finance, mid-career, senior, career changer) | M |
198  | 1.3 | Write + curate 5 LinkedIn Headline examples | S |
199  | 1.4 | Write + curate 5 CV profile summaries + bullet sets | M |
200  | 1.5 | Write + curate 5 cover letters (aligned with 3-para rule + no AI chars) | M |
201  | 1.6 | Build `select.ts` — keyword-based 1–2 example picker | S |
202  | 1.7 | Inject into `cover-letter.ts` prompt | S |
203  | 1.8 | Inject into LinkedIn plan generation | S |
204  | 1.9 | Inject into CV suggestions block | S |
205  | 1.10 | Manual QA: run 3 test analyses, verify richer output | S |
206  
207  **Deliverable:** Noticeably better cover letters and LinkedIn sections. Zero schema changes.
208  
209  **Risk:** Content curation (1.2–1.5) is the bottleneck. Set aside 3–4 hours for writing quality examples.
210  
211  ---
212  
213  ### Sprint 2 — A/B Testing Infrastructure
214  
215  **Goal:** Variant assignment, tracking, and first live experiment (golden examples vs. control).
216  
217  | # | Task | Effort |
218  |---|---|---|
219  | 2.1 | Supabase migration: `experiments` table | S |
220  | 2.2 | `lib/ab-testing.ts`: `assignVariant()`, `trackExperiment()` | S |
221  | 2.3 | Integrate variant assignment into `analyze-stream/route.ts` | S |
222  | 2.4 | Add `experimentVariant` to analysis metadata + `analyses` stored row | S |
223  | 2.5 | Configure first experiment: `golden-examples-v1` (control = no examples) | S |
224  | 2.6 | UI A/B middleware (Next.js edge) for landing page CTA test | M |
225  | 2.7 | Internal dashboard SQL query for variant comparison | S |
226  | 2.8 | Document experiment protocol (how to read results, significance thresholds) | S |
227  
228  **Deliverable:** A/B experiment live. Data flowing. First results in ~2 weeks.
229  
230  ---
231  
232  ### Sprint 3 — Automated Quality Testing
233  
234  **Goal:** Weekly CI quality gate that catches prompt regressions before they reach users.
235  
236  | # | Task | Effort |
237  |---|---|---|
238  | 3.1 | Write 5 synthetic CV + JD test fixtures (realistic, privacy-safe) | M |
239  | 3.2 | Build `judge.ts` — Claude-as-judge prompt with 5-dimension rubric | M |
240  | 3.3 | Build `rubric.ts` — scoring logic, pass/fail thresholds | S |
241  | 3.4 | Build `run-quality.test.ts` — fixture runner, calls real API | M |
242  | 3.5 | GitHub Actions `quality.yml` (weekly cron + `workflow_dispatch`) | S |
243  | 3.6 | Quality report: JSON + GitHub Actions summary table | S |
244  | 3.7 | Test both A/B variants in quality runner (baseline vs. golden examples) | S |
245  | 3.8 | Set initial quality baselines from first run | S |
246  
247  **Deliverable:** Weekly quality report. Any future prompt change has a measurable quality impact.
248  
249  ---
250  
251  ### Sprint 4 — Monthly Retraining Pipeline
252  
253  **Goal:** Close the loop. Real user feedback feeds back into golden examples.
254  
255  | # | Task | Effort |
256  |---|---|---|
257  | 4.1 | Audit feedback table: ensure `analysis_id` always populated | S |
258  | 4.2 | `scripts/export-golden-candidates.cjs` — SQL join + section extraction | M |
259  | 4.3 | Human review format: `review/YYYY-MM/candidates.md` with APPROVE/REJECT | S |
260  | 4.4 | `scripts/promote-to-golden.cjs` — reads approvals, updates JSON files | M |
261  | 4.5 | Monthly GitHub Actions cron job | S |
262  | 4.6 | Run first export on existing data | S |
263  | 4.7 | Review + promote first real-user examples | S |
264  | 4.8 | Privacy policy check / update for internal quality use | S |
265  | 4.9 | Write "Monthly Quality Review" runbook | S |
266  
267  **Deliverable:** First real-world golden examples promoted. The system self-improves.
268  
269  ---
270  
271  ## Dependency Graph
272  
273  ```
274  Sprint 1 (Golden Examples)
275      │
276      ├──→ Sprint 2 (A/B) — uses Sprint 1 examples as the "treatment"
277      │        │
278      │        └──→ Sprint 4 (Retraining) — needs Sprint 2 analysis_id tracking in experiments
279      │
280      └──→ Sprint 3 (Quality Tests) — uses Sprint 1 examples as benchmarks
281                │
282                └──→ Sprint 4 — quality tests validate each monthly promotion
283  ```
284  
285  Sprint 3 can start in parallel with Sprint 2 after Sprint 1 is done.
286  
287  ---
288  
289  ## Key Decisions to Make Before Starting
290  
291  1. **Content curation ownership**: Who writes the 5 golden examples per category? This is the most important decision — they should be written by someone who has reviewed dozens of real LinkedIn profiles and CVs, not generated by Claude (irony aside).
292  
293  2. **A/B traffic threshold**: With low/moderate traffic, you need ~100 analyses per variant to see meaningful signal. Set a minimum run duration (2 weeks) before reading results.
294  
295  3. **What counts as "retraining-eligible" feedback**: Currently `rating=true AND selectedIssues=[]` (positive, no issues). Should you also promote `rating=false` outputs with specific issues as negative examples? Recommend: start positive-only, add negative examples in a later iteration.
296  
297  4. **Golden examples for non-English users**: The app supports 6 languages. Golden examples in English only will slightly dilute the few-shot benefit for non-English runs. Acceptable for now — address in a later sprint when real translated examples exist.