Cradicle Explorer

/ docs / 02-architecture / architecture.md
architecture.md
  1  ---
  2  title: 'Architecture'
  3  category: 'architecture'
  4  last_verified: '2026-03-13'
  5  related_files:
  6    - 'src/stages/'
  7    - 'scripts/claude-orchestrator.sh'
  8    - 'scripts/claude-batch.js'
  9    - 'scripts/claude-store-wrapper.js'
 10    - 'scripts/monitoring-checks.sh'
 11    - 'src/utils/programmatic-scorer.js'
 12    - 'src/utils/circuit-breaker.js'
 13    - 'src/api/free-score-api.js'
 14    - 'website/scan.php'
 15  tags: ['architecture', 'database', 'api', 'ai', 'llm', 'email', 'sms', 'scoring', 'orchestrator']
 16  status: 'current'
 17  replaces: []
 18  ---
 19  
 20  # 333 Method Automation: Architecture
 21  
 22  > **Note**: Sections below marked _(POC/MVP design)_ document historical design decisions. The **Current Architecture** section reflects the production system as of 2026-03-13.
 23  
 24  ## Current Architecture (2026-03-13)
 25  
 26  Fully automated SERP→outreach pipeline. Node.js + Playwright + SQLite. All LLM work routes through `claude -p` (Claude Max subscription, $0 incremental cost) via the orchestrator.
 27  
 28  ### Pipeline Stages
 29  
 30  9 independent stages (`src/stages/`):
 31  
 32  ```mermaid
 33  flowchart LR
 34      KW["Keywords"] --> SERP["SERPs"]
 35      SERP -->|found| ASSET["Assets"]
 36      ASSET -->|assets_captured| SCORE["Scoring"]
 37      SCORE -->|prog_scored| RESCORE["Rescoring"]
 38      RESCORE -->|semantic_scored\nvision_scored| ENRICH["Enrich"]
 39      ENRICH -->|enriched| PROP["Proposals"]
 40      PROP -->|proposals_drafted| OUT["Outreach"]
 41      OUT -->|outreach_sent\noutreach_partial| DONE((" "))
 42  
 43      SCORE -->|"≥82"| HS["high_score"]
 44      RESCORE -->|"≥82"| HS
 45  
 46      IN["Inbound"] --> REPLY["Replies"]
 47  
 48      style HS fill:#4a4,stroke:#333,color:#fff
 49      style DONE fill:#333,stroke:#333
 50  ```
 51  
 52  | Stage     | Entry Status                        | Exit Status                          | Key Flag                |
 53  | --------- | ----------------------------------- | ------------------------------------ | ----------------------- |
 54  | SERPs     | _(keyword)_                         | `found`                              | —                       |
 55  | Assets    | `found`                             | `assets_captured`                    | —                       |
 56  | Scoring   | `assets_captured`                   | `prog_scored`                        | `ENABLE_LLM_SCORING`    |
 57  | Rescoring | `prog_scored`                       | `semantic_scored` / `vision_scored`  | `ENABLE_VISION`         |
 58  | Enrich    | `semantic_scored` / `vision_scored` | `enriched`                           | `ENABLE_ENRICHMENT_LLM` |
 59  | Proposals | `enriched`                          | `proposals_drafted`                  | —                       |
 60  | Outreach  | `proposals_drafted`                 | `outreach_sent` / `outreach_partial` | `SKIP_STAGES`           |
 61  | Replies   | _(inbound)_                         | _(processed)_                        | —                       |
 62  
 63  **Terminal statuses**: `high_score` (≥82, doesn't need help), `ignored` (blocklisted/excluded), `failing` (permanent error).
 64  
 65  ### Pipeline Loops
 66  
 67  The system runs **4 parallel loops**. Three run inside `src/pipeline-service.js`; the fourth is `scripts/claude-orchestrator.sh`.
 68  
 69  **Terminology:**
 70  
 71  - **Stage** — a named unit of code work in a pipeline-service loop. Processes N sites per cycle using local code or external APIs (Playwright, ZenRows, Twilio, Resend). Controlled by `SKIP_STAGES`.
 72  - **Batch** — a named LLM job sent to the `claude_loop` orchestrator. Each batch is one `claude -p` invocation → JSON response for a specific task type. Batches run asynchronously via Claude Max (zero incremental cost).
 73  
 74  | Loop            | Runs in                                   | Unit    | Stages / Batches                                                                                                                                           | DB tracking column      |
 75  | --------------- | ----------------------------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------- |
 76  | `browser_loop`  | `pipeline-service.js` `runBrowserLoop()`  | Stages  | Assets, Enrich                                                                                                                                             | `last_browser_loop_at`  |
 77  | `api_loop`      | `pipeline-service.js` `runApiLoop()`      | Stage   | SERPs                                                                                                                                                      | `last_api_loop_at`      |
 78  | `outreach_loop` | `pipeline-service.js` `runOutreachLoop()` | Stages  | Outreach, Replies                                                                                                                                          | `last_outreach_loop_at` |
 79  | `claude_loop`   | `claude-orchestrator.sh --loop`           | Batches | score*sites, score_semantic, enrich_sites, proposals*\_, reword\_\_, proofread, classify_replies, reply_responses, extract_names, oversee, classify_errors | _(systemd timer)_       |
 80  
 81  `browser_loop` and `api_loop` are separated so Playwright work never blocks SERP scraping. `outreach_loop` is separated so 3-day cooldown idle time never starves reply processing.
 82  
 83  ### Scoring Architecture (3-Stage Model)
 84  
 85  ```mermaid
 86  flowchart TD
 87      AC["assets_captured"] --> PROG["Programmatic Scoring\nsrc/utils/programmatic-scorer.js\n(rule-based DOM/regex, 7 factors)"]
 88      PROG -->|prog_scored| SEM["Semantic Scoring\norchestrator score_semantic (Haiku)\n(headline, value prop, USP)"]
 89      SEM -->|semantic_scored| NEXT["→ Enrich"]
 90  
 91      PROG -->|"ENABLE_LLM_SCORING=true"| LLM["LLM Scoring\norchestrator score_sites (Sonnet)\n(full CRO analysis)"]
 92      LLM -->|prog_scored| SEM
 93  
 94      SEM -->|"ENABLE_VISION=true"| VIS["Vision Rescoring\nOpenRouter GPT-4o-mini\n(below-fold screenshot + contacts)"]
 95      VIS -->|vision_scored| NEXT
 96  
 97      SEM -->|"score ≥ 82"| HS["high_score (skipped)"]
 98      VIS -->|"score ≥ 82"| HS
 99  
100      style HS fill:#4a4,stroke:#333,color:#fff
101  ```
102  
103  **Default path** (`ENABLE_LLM_SCORING=false`, `ENABLE_VISION=false`):
104  
105  1. **Programmatic** (`programmatic-scorer.js`) — rule-based DOM/regex for 7 structural factors → `prog_scored`
106  2. **Semantic** (orchestrator `score_semantic`, Haiku) — headline_quality, value_proposition, USP → `semantic_scored`
107  3. Zero OpenRouter cost; Haiku via Claude Max
108  
109  **Optional paths**:
110  
111  - `ENABLE_LLM_SCORING=true` — Sonnet full CRO scoring via orchestrator `score_sites` batch (replaces OpenRouter GPT-4o-mini)
112  - `ENABLE_VISION=true` — GPT-4o-mini below-fold vision rescoring → `vision_scored` (OpenRouter, ~$0.003/site)
113  
114  **Threshold**: Sites scoring ≥ 82 (B- and above) → `high_score` (skipped). Sites < 82 proceed to enrichment/outreach.
115  
116  ### LLM Work (Claude Max Orchestrator)
117  
118  All LLM calls run through `scripts/claude-orchestrator.sh` via `claude -p`. Output passes through `claude-store-wrapper.js` (JSON repair) → `claude-store.js` (DB persistence).
119  
120  ```mermaid
121  flowchart TD
122      subgraph orch["claude-orchestrator.sh --loop"]
123          direction TB
124          CAN["Canary Check\n(node, claude-batch, claude CLI)"]
125  
126          subgraph opus["Opus Batches"]
127              PE["proposals_email (5)"]
128              PS["proposals_sms (10)"]
129              RE["reword_email (10)"]
130              RS["reword_sms (30)"]
131              RF["reword_form (20)"]
132              RL["reword_linkedin (15)"]
133              RX["reword_x (20)"]
134              RR["reply_responses (10)"]
135              PR["proofread (50)"]
136          end
137  
138          subgraph sonnet["Sonnet Batches"]
139              SS["score_sites (10)"]
140              OV["oversee (1)\n30min gate"]
141          end
142  
143          subgraph haiku["Haiku Batches"]
144              SC["score_semantic (20)"]
145              ES["enrich_sites (5)"]
146              CR["classify_replies (50)"]
147              EN["extract_names (50)"]
148              CE["classify_errors (1)\n4h gate"]
149          end
150      end
151  
152      orch --> WRAP["claude-store-wrapper.js\n(JSON repair)"]
153      WRAP --> STORE["claude-store.js\n(DB persist)"]
154  ```
155  
156  **16 batch types** (run via `--loop` or individually via `--type <name>`):
157  
158  | Batch              | Model  | Size | Purpose                                     |
159  | ------------------ | ------ | ---- | ------------------------------------------- |
160  | `proposals_email`  | Opus   | 5    | Generate email proposals                    |
161  | `proposals_sms`    | Opus   | 10   | Generate SMS proposals                      |
162  | `reword_email`     | Opus   | 10   | Improve email messaging                     |
163  | `reword_sms`       | Opus   | 30   | Improve SMS (<160 chars)                    |
164  | `reword_form`      | Opus   | 20   | Improve form submissions                    |
165  | `reword_linkedin`  | Opus   | 15   | Improve LinkedIn DMs                        |
166  | `reword_x`         | Opus   | 20   | Improve X/Twitter DMs                       |
167  | `score_semantic`   | Haiku  | 20   | Headline/value prop/USP scoring             |
168  | `score_sites`      | Sonnet | 10   | Full CRO scoring on HTML                    |
169  | `enrich_sites`     | Haiku  | 5    | Contact extraction from HTML                |
170  | `proofread`        | Opus   | 50   | QA approve/rework/reject messages           |
171  | `classify_replies` | Haiku  | 50   | Inbound intent + sentiment                  |
172  | `extract_names`    | Haiku  | 50   | First name from email addresses             |
173  | `reply_responses`  | Opus   | 10   | Sales funnel replies (time-critical)        |
174  | `oversee`          | Sonnet | 1    | Pipeline health check (30min gate)          |
175  | `classify_errors`  | Haiku  | 1    | Regex patterns for unknown errors (4h gate) |
176  
177  **Conservation mode**: At ≥80% of 5h window or ≥90% weekly Claude Max usage:
178  
179  - **Always runs** (time-critical/cheap): `reply_responses`, `classify_replies`, `extract_names`, `oversee`, `classify_errors`
180  - **Deferred**: `proposals_*`, `reword_*`, `score_*`, `enrich_sites`, `proofread`
181  
182  **Canary check** (once per cycle): Verifies `node`, `claude-batch.js`, and `claude` CLI are executable before processing any batches. Catches glibc/PATH/symlink issues early.
183  
184  **JSON repair pipeline** (`claude-store-wrapper.js`): Unwraps Claude envelope → strips markdown fences → escapes control chars → iterative quote repair (up to 20 passes) → NDJSON reconstruction → validation. Failures dump to `logs/orch-fail-{batch_type}.json`.
185  
186  ### Monitoring (3 Tiers)
187  
188  ```mermaid
189  flowchart TD
190      subgraph T1["Tier 1 — Cron (systemd timer)"]
191          PG["Process Guardian\n(1min)"]
192          PM["Pipeline Monitor\n(5min)"]
193          SH["System Health\n(30min)"]
194      end
195  
196      subgraph T2["Tier 2 — Agents (disabled)"]
197          AG["AGENT_SYSTEM_ENABLED=false"]
198      end
199  
200      subgraph T3["Tier 3 — Claude Code AFK"]
201          MC["monitoring-checks.sh\n(30min)"]
202      end
203  
204      subgraph ORC["Orchestrator Overseer"]
205          OV2["oversee batch (Sonnet)\n30min gate"]
206      end
207  
208      OV2 -->|actions| RST["RESTART_PIPELINE"]
209      OV2 -->|actions| CLR["CLEAR_STALE_TASKS"]
210      OV2 -->|actions| RES["RESET_STUCK_SITES"]
211  
212      style T2 fill:#555,stroke:#333,color:#999
213  ```
214  
215  - **Tier 1 — Cron** (`src/cron/`, systemd timer): Process Guardian (1min), Pipeline Monitor (5min), System Health (30min)
216  - **Tier 2 — Agents** (currently `AGENT_SYSTEM_ENABLED=false`)
217  - **Tier 3 — Claude Code AFK**: `bash scripts/monitoring-checks.sh` every 30min; finds blind spots, fixes code, commits
218  - **Orchestrator overseer**: Runs inside orchestrator loop (30min gate) — autonomous RESTART_PIPELINE / CLEAR_STALE_TASKS / RESET_STUCK_SITES. Replaced the former standalone `sonnet-overseer.js` cron job.
219  
220  ### Circuit Breakers & Auto-Skip
221  
222  API calls in pipeline stages are wrapped with circuit breakers (`src/utils/circuit-breaker.js`):
223  
224  ```mermaid
225  flowchart LR
226      API["API Call"] --> CB{"Circuit\nBreaker"}
227      CB -->|CLOSED| OK["Normal"]
228      CB -->|OPEN| SKIP["Stage Skipped"]
229      CB -->|HALF_OPEN| TEST["Recovery Test"]
230  
231      ERR["5xx / Timeout\n429 / 402"] -->|threshold| CB
232      RL["Rate Limit\nDetected"] --> RLS["rate-limit-scheduler\nauto-adds to SKIP_STAGES"]
233      RLS --> SKIP
234  
235      OK -->|success| CLOSE["CLOSED"]
236      TEST -->|success| CLOSE
237      TEST -->|failure| OPEN["OPEN"]
238  ```
239  
240  | Breaker    | API                  | Timeout | Limit Type | Stages                                |
241  | ---------- | -------------------- | ------- | ---------- | ------------------------------------- |
242  | OpenRouter | AI scoring/proposals | 120s    | hourly     | scoring, rescoring, proposals         |
243  | ZenRows    | SERP scraping        | 300s    | daily      | serps                                 |
244  | Twilio     | SMS sending          | 30s     | hourly     | outreach                              |
245  | Resend     | Email sending        | 30s     | hourly     | outreach                              |
246  | ZeroBounce | Email validation     | 90s     | hourly     | _(fail-open, doesn't block outreach)_ |
247  
248  **Auto-skip**: When a breaker opens (429/402), `rate-limit-scheduler.js` dynamically adds affected stages to `SKIP_STAGES`. When the rate limit window resets, stages are automatically re-enabled. No manual `.env` editing needed.
249  
250  - State persists in `logs/rate-limits.json`
251  - Check: `npm run rate-limits`
252  
253  ### Inbound Sales Funnel (Free Website Scanner)
254  
255  Complements the outbound pipeline with a self-service inbound channel. Prospects enter their own URL, get a free score, and self-select into paid products.
256  
257  **Funnel flow:**
258  
259  ```mermaid
260  flowchart TD
261      AD["Ad\n(Google / Facebook / LinkedIn)"] --> SCAN["BRAND_URL/scan\n(enter URL)"]
262      SCAN -->|"POST /api/score"| API["Node.js Scoring API\n(programmatic scorer, $0/scan)"]
263      API --> SCORE["Instant score + grade\n+ traffic lights"]
264      SCORE --> EMAIL["Email capture\n(gates factor breakdown)"]
265      EMAIL --> PEEK["Free peek\n(worst factor, detailed)"]
266      PEEK --> QF["$47 Quick Fixes Report\n(tripwire)"]
267      QF --> FULL["$297 Full CRO Audit\n(core product)"]
268  ```
269  
270  **Architecture:**
271  
272  ```mermaid
273  flowchart TD
274      subgraph PHP["PHP Frontend (Hostinger)"]
275          SCANPHP["scan.php — URL input + animated score reveal"]
276          APIPHP["api.php — proxies to Node.js API, caches results"]
277          MAINJS["main.js — scanner animations, email capture, PayPal"]
278      end
279  
280      PHP -->|"POST /api/score"| NODE
281  
282      subgraph NODE["Node.js Scoring API (NixOS, systemd)"]
283          FREEAPI["src/api/free-score-api.js"]
284          FETCH["Fetches HTML via HTTP (not Playwright)"]
285          SCORER["scoreWebsiteProgrammatically() — $0"]
286          RATE["Rate limits: 10/IP/hour, CAPTCHA after 3"]
287          CACHE["Cache: same URL → 24h"]
288      end
289  
290      NODE --> DB
291  
292      subgraph DB["SQLite (free_scans table)"]
293          COLS["scan_id, url, domain, email, score, grade, score_json\nutm_source/medium/campaign, converted_to, expires_at\nIndustry benchmarks from 23,990+ scored sites"]
294      end
295  ```
296  
297  **Product ladder:**
298  
299  | Tier              | Product                    | Price | Reveals                                                        |
300  | ----------------- | -------------------------- | ----- | -------------------------------------------------------------- |
301  | Free (pre-email)  | Score + grade + percentile | $0    | Overall score only                                             |
302  | Free (post-email) | Traffic lights + free peek | $0    | 10 factor indicators (red/amber/green) + weakest factor detail |
303  | Quick Fixes       | 3-5 page PDF               | $47   | All 10 factor scores + top 3 fixes with exact copy             |
304  | Full Audit        | Comprehensive report       | $297  | Vision analysis + screenshots + full action plan + competitors |
305  
306  **Key files:**
307  
308  - `src/api/free-score-api.js` — Scoring API endpoint (Express)
309  - `src/utils/programmatic-scorer.js` — Core scoring engine (reused, zero cost)
310  - `website/scan.php` — Scanner landing page (in the website repo)
311  - `website/assets/js/scanner.js` — Frontend scan flow (in the website repo)
312  - `db/migrations/095-free-scans.sql` — Schema migration
313  
314  **Email drip (post-scan, no purchase):** 5 emails over 14 days via Resend. Day 0: recap, Day 2: free tip, Day 5: social proof, Day 7: results expiring, Day 14: auto re-scan.
315  
316  ---
317  
318  ## Original Design (POC/MVP — historical reference)
319  
320  Target: 90% automated SERP→outreach pipeline for low-score local biz sites. Local Node.js + Playwright stack with Cline-assisted development.
321  
322  ## Functional Requirements
323  
324  **SERP Scraping**:
325  
326  - Input: Keyword (e.g. "plumber Seattle"), directory (Yelp/Angi first).
327  - Extract: Top 10 URLs + SERP contacts.
328  - Tool: ZenRows API (`$49/10K results` free tier 1K).
329  
330  **Site Processing** (parallel tabs, max 15):
331  
332  - Visit URL → wait full load (network idle).
333  - Capture: Domain, landing URL, keyword, above-fold screenshot (Playwright native), below-fold (scroll+clip), mobile above-fold, full rendered DOM.
334  - **Screenshot Optimization**: Apply intelligent cropping (DOM-based nav/footer removal + Sharp attention-based resizing) to reduce LLM token usage by 20-35%.
335  - **Contact Check**: Parse DOM for ≥2 primaries (email regex, tel:, form action or link to contact page). If <2: Inject operator panel + use Playwright's `page.pause()` for human intervention.
336  
337  **LLM Scoring** (GPT-4o-mini via OpenRouter SDK):
338  
339  - Prompt: Assets + "Score A+-F on \[factors: offer/CTA/USP/etc.\] + extract contacts if low".
340  - Post-SERP: Find top scorer → competitor for lows.
341  
342  **Outreach Generation** (only for scores B- to E):
343  
344  - LLM: 3x proposal variants + 3x subject variants (prompt includes competitor data, scores, ≤5 sales samples for keyword, ≤5 samples for country, best practices docs).
345  - **Channel Selection Logic**: Prioritize by engagement (SMS > Contact Form > Email > Social). Match variant style to channel (Variant 1=short/urgent for SMS, Variant 2=professional for forms, Variant 3=detailed for email).
346  
347  **Dispatch & Store**:
348  
349  - Send/track each variant via appropriate channel.
350  - SQLite: `outreaches` table with delivery tracking, click tracking, and sale flag.
351  - **Operator Tools**: DBeaver for data edits (POC/MVP), Streamlit dashboard (Phase 3).
352  
353  **Edge Handling**:
354  
355  - **Bot Evasion**: puppeteer-extra-plugin-stealth for anti-detection.
356  - **Parallelization**: Max 15 concurrent pages; monitor memory/tab limits.
357  - **Error Recovery**: Retry logic with exponential backoff for API failures; timeout handling (30s max per site).
358  - **Cline Integration**: Capture console errors via `page.on('console')` and `page.on('pageerror')` for automated debugging.
359  
360  # Overall Architecture
361  
362  ```mermaid
363  flowchart TD
364      subgraph APP["Node.js Application (NixOS)"]
365          SERP["SERP Scraping\nZenRows API → URL list"]
366          subgraph BROWSER["Browser Automation — Playwright"]
367              PAGE["Page Processing\nScreenshots + DOM capture"]
368              SHARP["Screenshot Optimization\nSharp (resize/crop/compress)"]
369          end
370          subgraph SCORING["Scoring"]
371              PROG2["Programmatic Scorer\n(DOM/regex, $0)"]
372              CLAUDE["Claude Max Orchestrator\n(Haiku/Sonnet/Opus, $0)"]
373              OR["OpenRouter\n(optional vision, ~$0.003/site)"]
374          end
375          subgraph OUTREACH2["Outreach"]
376              SMS["Twilio SMS"]
377              EMAILCH["Resend Email\n(direct fetch API)"]
378              FORM["Form Automation\n(Playwright)"]
379              SOCIAL["X / LinkedIn"]
380          end
381          STORE2["SQLite (better-sqlite3)"]
382      end
383  
384      SERP --> BROWSER --> SCORING --> OUTREACH2 --> STORE2
385  ```
386  
387  # Tech Stack
388  
389  - **Runtime**: Node.js v18+ (native ESM support)
390  - **Browser Automation**:
391    - `playwright` (headed chromium, native screenshots)
392    - `playwright-extra` + `puppeteer-extra-plugin-stealth` (anti-detection)
393  - **Image Processing**: `sharp` (resize, crop, compress for LLM optimization)
394  - **Database**:
395    - `better-sqlite3` (faster than sql.js, DBeaver-compatible)
396    - POC/MVP: DBeaver for read/write
397    - Phase 3: Custom Streamlit dashboard
398  - **APIs**:
399    - ZenRows (SERP scraping)
400    - OpenRouter (GPT-4o-mini for scoring/proposals)
401    - Twilio (SMS + webhooks)
402    - Resend (SMTP for email channel - Phase 3)
403  - **Development**: VSCodium with Cline extension for automated debugging via console.log capture
404  
405  # Execution Flow
406  
407  1. `node src/poc.js "keyword"` → launches headed Playwright browser
408  2. ZenRows fetches SERP → extracts top 10 URLs
409  3. Process sites in batches (15 concurrent pages max):
410     - Navigate to URL, wait for network idle
411     - Capture 3 screenshots (desktop above/below, mobile above)
412     - Optimize images (DOM-based cropping + Sharp resize)
413     - Extract HTML DOM
414  4. Send to OpenRouter for scoring
415  5. Store results in SQLite
416  6. (MVP+) Generate proposals for low scorers → dispatch via channels
417  7. Console errors → logged for Cline to auto-fix
418  
419  **Entry Point**: `src/main.js`:
420  
421  ```javascript
422  const { chromium } = require('playwright-extra');
423  const stealth = require('puppeteer-extra-plugin-stealth')();
424  chromium.use(stealth);
425  
426  async function main(keyword) {
427    const browser = await chromium.launch({
428      headless: false,
429      slowMo: 100,
430    });
431  
432    const context = await browser.newContext();
433  
434    // Process pipeline...
435    const sites = await scrapeSERP(keyword);
436    await processSites(context, sites);
437  
438    await browser.close();
439  }
440  ```
441  
442  # Integration Landscape Diagram
443  
444  ```mermaid
445  flowchart LR
446      subgraph POC["Phase 1: POC — $1.30 / 1K sites"]
447          ZR["ZenRows\nSERP API"] --> PW["Playwright\nBrowser (×15)\nScreenshots + DOM\n+ Sharp optim"]
448          PW --> ORS["OpenRouter\nGPT-4o-mini\nScoring + Vision"]
449          ORS --> DB1["SQLite"]
450      end
451  
452      subgraph MVP["Phase 2: MVP — $61 / 10K sites"]
453          PROP["OpenRouter\nProposals\n(3 variants)"] --> CH["Contact Channels\n1. Twilio SMS\n2. Contact Form"]
454          CH --> DB2["SQLite:\noutreaches\n+ conversations"]
455      end
456  
457      subgraph FULL["Phase 3: Full System — $71/mo / 10K sites"]
458          direction TB
459          subgraph OUT["Outbound Channels"]
460              R2["Resend Email"]
461              SOC["Social Browser\n(LinkedIn/X)"]
462          end
463          subgraph IN["Inbound Handling"]
464              INSMS["Inbound SMS\n(Twilio webhook)"]
465              INEML["Inbound Email\n(Resend webhook)"]
466          end
467          INSMS --> CONV["SQLite:\nconversations\n+ reply templates"]
468          INEML --> CONV
469      end
470  
471      POC --> MVP --> FULL
472  ```
473  
474  # Data Flow
475  
476  ```mermaid
477  flowchart LR
478      KW2["Keyword"] --> SERP2["SERP"] --> SITE["Site"] --> ASSET2["Asset"] --> SCORE2["Score"] --> PROPOSAL["Proposal"] --> OUTREACH3["Outreach"] --> CONVO["Conversation"]
479  ```
480  
481  ---
482  
483  # SQLite Schema
484  
485  **Access via**: DBeaver (POC/MVP), Streamlit dashboard (Phase 3)
486  
487  **Core Tables:**
488  
489  - **sites** - Main site data with screenshots (cropped + uncropped), HTML DOM, conversion scores, processing status
490  - **keywords** - Keyword tracking with ZenRows count, processed count, low-scoring metrics, last scraped timestamp
491  - **outreaches** - Proposal variants (1-3) by channel with delivery tracking, click tracking, sale outcomes
492  - **conversations** - Threaded inbound/outbound messages linked to outreaches, with sentiment analysis
493  - **config** - Global settings (sender details, low score cutoff=82, templates, API keys)
494  
495  **Key Features:**
496  
497  - Screenshot storage: Both cropped (AI-optimized) and uncropped (full content) versions
498  - Conversion scores: JSON blob with detailed scoring + conversion_score (0-100) for filtering
499  - Processing pipeline: Status tracking (found → assets_captured → prog_scored → semantic_scored → enriched → proposals_drafted → outreach_sent)
500  - Multi-channel support: SMS, Email, Contact Form, LinkedIn, Facebook, Instagram
501  - Analytics ready: Indexes on key fields for performance (keyword, status, channel, timestamps)
502  
503  See full schema: [db/schema.sql](db/schema.sql)
504  
505  ---
506  
507  ## LLM Calls in Pipeline (Current)
508  
509  All LLM calls run through `claude -p` (Claude Max, $0 incremental) via the orchestrator, except vision rescoring which uses OpenRouter when `ENABLE_VISION=true`.
510  
511  | #   | Stage                  | Model                    | Trigger                               | Output                                          |
512  | --- | ---------------------- | ------------------------ | ------------------------------------- | ----------------------------------------------- |
513  | 1   | Programmatic scoring   | _(none — rule-based)_    | Always (DOM/regex)                    | Score + grade + 7 factor_scores                 |
514  | 2   | Semantic scoring       | Haiku (`claude -p`)      | `score_semantic` orchestrator batch   | headline_quality, value_proposition, USP scores |
515  | 3   | CRO scoring (optional) | Sonnet (`claude -p`)     | `score_sites` orchestrator batch      | Full CRO factor analysis                        |
516  | 4   | Vision rescoring (opt) | GPT-4o-mini (OpenRouter) | `ENABLE_VISION=true`                  | Below-fold re-score + contacts                  |
517  | 5   | Enrichment             | Haiku (`claude -p`)      | `enrich_sites` orchestrator batch     | Contact extraction from HTML                    |
518  | 6   | Proposals              | Opus (`claude -p`)       | `proposals_*` orchestrator batch      | N personalized outreach messages                |
519  | 7   | Rewording              | Opus (`claude -p`)       | `reword_*` orchestrator batch         | Improved messaging (trust/proof framework)      |
520  | 8   | Proofreading           | Opus (`claude -p`)       | `proofread` orchestrator batch        | QA approve/rework/reject before send            |
521  | 9   | Reply classification   | Haiku (`claude -p`)      | `classify_replies` orchestrator batch | intent + sentiment                              |
522  | 10  | Name extraction        | Haiku (`claude -p`)      | `extract_names` orchestrator batch    | first name from email address                   |
523  | 11  | Reply responses        | Opus (`claude -p`)       | `reply_responses` orchestrator batch  | sales funnel reply message                      |
524  | 12  | Oversight              | Sonnet (`claude -p`)     | `oversee` (30min gate)                | corrective actions (restart, reset, clear)      |
525  | 13  | Error classification   | Haiku (`claude -p`)      | `classify_errors` (4h gate)           | regex patterns for unknown errors               |
526  
527  **Cost breakdown** (current defaults, `ENABLE_VISION=false`):
528  
529  - Per-site pipeline cost: ~$0 (programmatic scoring + Haiku via Claude Max)
530  - Proposals + rewording + proofreading + oversight: $0 incremental (Claude Max subscription)
531  - OpenRouter: $0 (only charged if vision rescoring re-enabled)
532  
533  ### Contact Extraction (Current)
534  
535  1. **HTML regex** (`src/utils/html-contact-extractor.js`): emails, phones, forms — always runs
536  2. **Enrichment** (orchestrator `enrich_sites`, Haiku): contact extraction from About/Contact/Legal pages via stealth browser
537  3. **Vision rescoring** (`ENABLE_VISION=true`, optional): contacts from below-fold screenshots (OpenRouter)
538  
539  ### Email Sending
540  
541  Resend SDK replaced with direct `fetch()` to `https://api.resend.com/emails` with `AbortSignal.timeout(20s)`. Fixes stale TCP keep-alive hangs on NixOS (SDK had no abort support). Circuit breaker and rate limiting still apply.
542  
543  ### Prompt Files
544  
545  | Prompt                                     | Used by                                    |
546  | ------------------------------------------ | ------------------------------------------ |
547  | `prompts/PROPOSAL.md`                      | Orchestrator proposals + rewording batches |
548  | `prompts/REPLIES.md`                       | Orchestrator reply responses               |
549  | `docs/05-outreach/email-best-practices.md` | Email proposal context                     |
550  | `docs/05-outreach/sms-best-practices.md`   | SMS proposal context                       |
551  | `prompts/SCORING.md`                       | Legacy LLM scoring (OpenRouter)            |
552  | `prompts/RESCORING.md`                     | Legacy rescoring (OpenRouter)              |
553  
554  - **VISION.md** = Image-to-text extraction instructions