pipeline-status-breakdown.md
1 # Pipeline Status Breakdown System 2 3 > ✅ IMPLEMENTED (2026-02-28, commit 6e2c42d8) 4 > 5 > **Deliverables:** `src/utils/error-categories.js`, `src/cli/status.js` (`npm run status`), `src/utils/outreach-guard.js`, `src/cron/classify-unknown-errors.js`, dashboard widgets, SMS business hours fix, assets 120s timeout, 7,662 sites reset, 1,908 outreaches reset. 39 new tests. 17 files changed. 6 7 ## Context 8 9 The pipeline has 335K ignored and 107K failing sites, 3.3K failed outreaches — but no visibility into _why_. Error messages exist in the DB but aren't categorized or surfaced. The user asked for a tree-view breakdown showing totals, 24h/1h deltas, and sub-statuses grouped by reason — both as a CLI command and a dashboard widget. 10 11 During investigation, we also found: 12 13 - **1,908 SMS outreaches permanently failed with no error** — caused by a bug where business-hours blocks throw, hit a catch that sets `status='failed'` without `error_message` 14 - **Assets stage hangs freeze the entire browser loop** — no per-site timeout on Playwright captures, which froze the user's host Chromium too 15 - **79 sites stuck at `failing` from EACCES** — unmounted SSD `/run/media/jason/store`; user wants them reprocessed 16 - **7,583 sites stuck at `failing` from browser launch conflict** — Playwright `--user-data-dir` error; reset for reprocessing 17 - **No outreach error dedup protection** — need a "reputation guard" that halts sending after 25 of the same error in 2h, for both Twilio and Resend 18 19 --- 20 21 ## Phase 1: Error Categorization Module 22 23 **New file: `src/utils/error-categories.js`** 24 25 Regex-based pattern matching — no LLM needed. Two groups per context: 26 27 ### Site Error Patterns 28 29 **Terminal (ignore-type, won't be retried):** 30 | Pattern | Label | 31 |---|---| 32 | `/Social media platform/i` | Social media | 33 | `/Business directory/i` | Business directory | 34 | `/Not a local business/i` | Not a local business | 35 | `/Home service franchise/i` | Franchise | 36 | `/Government/i` | Government domain | 37 | `/Education domain/i` | Education domain | 38 | `/Non-commercial/i` | Non-commercial | 39 | `/Cross-border duplicate/i` | Cross-border duplicate | 40 | `/Duplicate domain/i` | Duplicate domain | 41 | `/Country mismatch/i` | Country mismatch | 42 | `/No email template/i` | No template | 43 | `/HTTP 40[34].*Cannot capture/i` | Permanent HTTP error | 44 | `/Max recapture attempts/i` | Max recapture retries | 45 46 **Retriable (could succeed if retried/fixed):** 47 | Pattern | Label | 48 |---|---| 49 | `/userDataDir\|launchPersistentContext/i` | Browser launch conflict | 50 | `/status code 400/i` | API bad request | 51 | `/status code 401/i` | API auth error | 52 | `/status code 429/i` | Rate limited | 53 | `/Incomplete LLM response/i` | Incomplete LLM | 54 | `/Timed out\|Timeout/i` | Timeout | 55 | `/ECONNRESET\|ETIMEDOUT\|ENOTFOUND/i` | Network error | 56 | `/navigating and changing/i` | Page navigating | 57 | `/browser has been closed/i` | Browser crashed | 58 | `/EACCES/i` | Permission denied | 59 | `/screenshot_path is NULL/i` | Missing screenshot | 60 | `/database is locked/i` | DB lock | 61 | `/chromium-nice ENOENT/i` | Chromium not found | 62 | `/HTTP (5\d\d\|null)/i` | Server error | 63 64 ### Outreach Error Patterns 65 66 **Terminal:** 67 | Pattern | Label | 68 |---|---| 69 | `/opted out/i` | Opted out | 70 | `/Fake.*phone/i` | Invalid phone | 71 | `/gdpr_blocked/i` | GDPR blocked | 72 | `/Cloudflare.*blocked/i` | Cloudflare blocked | 73 74 **Retriable:** 75 | Pattern | Label | 76 |---|---| 77 | `/business hours/i` | Business hours block | 78 | `/status.?code 429\|rate/i` | Rate limited | 79 | `/Timeout/i` | Form timeout | 80 | `/net::ERR_NAME_NOT_RESOLVED/i` | DNS failure | 81 | `NULL or empty` | Unknown (no error stored) | 82 83 ### Exports 84 85 ```javascript 86 export function categorizeError(errorMessage, context = 'site') 87 // Returns { group: 'terminal'|'retriable'|'unknown', label: string } 88 89 export function buildStatusTree(db) 90 // Returns structured tree with totals, deltas, children grouped by terminal/retriable 91 92 export function buildOutreachTree(db) 93 // Returns outreach breakdown with same structure 94 ``` 95 96 The `buildStatusTree` SQL for deltas: 97 98 - Top-level: `COUNT(CASE WHEN updated_at > datetime('now','-24 hours') THEN 1 END)` per status 99 - Sub-status deltas: same pattern grouped by `error_message` category — each child row gets its own `delta_24h` and `delta_1h` 100 101 Cross-border duplicates collapsed into one row with total count. 102 103 --- 104 105 ## Phase 1b: Auto-Expanding Error Patterns (Daily Agent Task) 106 107 **Problem:** New error messages will emerge that don't match any regex. They'll pile up under "Unknown" without visibility. 108 109 **Solution:** A daily cron task that: 110 111 1. Queries `sites` and `outreaches` for `error_message` values that fall through to `unknown` in `categorizeError()` 112 2. Groups them by frequency (top uncategorized errors, min 5 occurrences) 113 3. Makes an LLM call (Haiku) asking: "Given these error messages, suggest regex patterns and labels for each group. Format as JSON array: `[{ pattern, label, group, context }]`" 114 4. **Writes proposals to a new DB table** `error_pattern_proposals` — does NOT auto-commit to code 115 5. A human reviews via dashboard and approves/rejects — approved patterns get merged into `error-categories.js` 116 117 **New table: `error_pattern_proposals`** 118 119 ```sql 120 CREATE TABLE error_pattern_proposals ( 121 id INTEGER PRIMARY KEY AUTOINCREMENT, 122 pattern TEXT NOT NULL, 123 label TEXT NOT NULL, 124 group_name TEXT NOT NULL CHECK(group_name IN ('terminal', 'retriable')), 125 context TEXT NOT NULL CHECK(context IN ('site', 'outreach')), 126 example_errors TEXT, -- JSON array of sample error messages that matched 127 occurrence_count INTEGER, -- how many errors this would have caught 128 status TEXT DEFAULT 'pending' CHECK(status IN ('pending', 'approved', 'rejected')), 129 created_at DATETIME DEFAULT CURRENT_TIMESTAMP, 130 reviewed_at DATETIME, 131 reviewed_by TEXT 132 ); 133 ``` 134 135 **New file: `src/cron/classify-unknown-errors.js`** 136 137 - Runs daily (via `cron_jobs` table entry `classifyUnknownErrors`) 138 - Uses `claude-haiku-4-5-20251001` via OpenRouter 139 - Prompt: structured JSON output only, no prose 140 - On approval: the agent task system queues a Developer task to update `error-categories.js` with the new pattern 141 142 **Dashboard: add review panel to `dashboard/pages/1_🔧_Pipeline.py`** 143 144 - Show pending proposals with sample errors, occurrence count, suggested pattern 145 - Approve/reject buttons (write back via `database.py`) 146 147 **New migration:** `db/migrations/067-error-pattern-proposals.sql` 148 149 --- 150 151 ## Phase 2: CLI Command — `npm run status` 152 153 **New file: `src/cli/status.js`** 154 155 Output format (ANSI colored, box-drawing chars). **Statuses in pipeline order** (found → assets_captured → scored → rescored → enriched → proposals_drafted → outreach_sent → high_score, then ignore and failing at the end as non-pipeline buckets): 156 157 ``` 158 Pipeline Status Total Δ24h Δ1h 159 ────────────────────────────────────────────────────────────── 160 ▸ found 26,646 +2,206 +0 161 ▸ assets_captured 294 +257 +0 162 ▸ scored 0 +0 +0 163 ▸ rescored 25,294 +352 +0 164 ▸ enriched 0 +0 +0 165 ▸ proposals_drafted 0 +0 +0 166 ▸ outreach_sent 2,273 +0 +0 167 ▸ high_score 3 +0 +0 168 ▸ ignore 334,956 +5,148 +0 169 ├─ Cross-border duplicate 250,799 +0 +0 170 ├─ Social media 56,252 +0 +0 171 ├─ Business directory 45,919 +0 +0 172 ├─ Duplicate domain 9,369 +0 +0 173 ├─ Not a local business 6,435 +0 +0 174 ├─ Franchise 6,229 +0 +0 175 ├─ Non-commercial 4,054 +0 +0 176 ├─ Government/Education 3,899 +0 +0 177 └─ Other 799 +0 +0 178 ▸ failing 106,843 +0 +0 179 ├─ Retriable 180 │ ├─ Browser launch conflict 7,583 +0 +0 181 │ ├─ API bad request 2,904 +0 +0 182 │ ├─ Missing screenshot 354 +0 +0 183 │ └─ ... (more) 184 └─ Terminal 185 ├─ Permanent HTTP error 68 +0 +0 186 └─ ... (more) 187 188 Outreach Breakdown Total Δ24h Δ1h 189 ────────────────────────────────────────────────────────────── 190 ▸ pending 45 +0 +0 191 ▸ approved (in cooldown) 2,865 +0 +0 192 ├─ form 2,139 +0 +0 193 ├─ linkedin 241 +0 +0 194 ├─ x 217 +0 +0 195 ├─ email 136 +0 +0 196 └─ sms 132 +0 +0 197 ▸ sent 215 +0 +0 198 ├─ sms 112 +0 +0 199 └─ email 103 +0 +0 200 ▸ delivered 0 +0 +0 201 ▸ opened 0 +0 +0 202 ▸ clicked 0 +0 +0 203 ▸ replied 0 +0 +0 204 ▸ failed 3,260 +0 +0 205 ├─ Retriable 206 │ ├─ Business hours block 1,807 +0 +0 207 │ ├─ Rate limited (Resend 429) 1,282 +0 +0 208 │ └─ Unknown (no error stored) 101 +0 +0 209 └─ Terminal 210 ├─ Invalid phone 25 +0 +0 211 ├─ Cloudflare blocked 1 +0 +0 212 └─ Form timeout (terminal) 34 +0 +0 213 ▸ bounced 12 +0 +0 214 ▸ gdpr_blocked 347 +0 +0 215 ``` 216 217 **Status ordering:** 218 219 - Sites: pipeline sequence first (`found → assets_captured → scored → rescored → enriched → proposals_drafted → outreach_sent → high_score`), then non-pipeline (`ignore`, `failing`) at bottom 220 - Outreaches: delivery lifecycle order (`pending → approved → sent → delivered → opened → clicked → replied → failed → bounced → gdpr_blocked`) 221 222 **Color rules:** 223 224 - Green delta: increases on good statuses (`enriched`, `outreach_sent`, `sent`, `delivered`, `opened`, `replied`) 225 - Red delta: increases on bad statuses (`failing`, `failed`, `bounced`) 226 - Grey: `ignore` and duplicates (expected, not actionable) 227 - Cyan: neutral mid-pipeline statuses (`found`, `rescored`, `assets_captured`, `approved`) 228 229 Uses `src/utils/summary-generator.js` color patterns. Add to `package.json`: `"status": "node src/cli/status.js"` 230 231 --- 232 233 ## Phase 3: Dashboard Widget 234 235 **Modify: `dashboard/pages/1_🔧_Pipeline.py`** 236 237 Add a "Status Breakdown" section using `st.expander()` per status: 238 239 ```python 240 st.subheader("Status Breakdown") 241 for row in status_tree: 242 label = f"{row['status']} — {row['total']:,}" 243 delta_parts = [] 244 if row['delta_24h']: delta_parts.append(f"+{row['delta_24h']:,} 24h") 245 if row['delta_1h']: delta_parts.append(f"+{row['delta_1h']:,} 1h") 246 if delta_parts: label += f" ({', '.join(delta_parts)})" 247 248 with st.expander(label): 249 if 'retriable' in row['children']: 250 st.markdown("**🔄 Retriable**") 251 st.dataframe(retriable_df, hide_index=True) 252 st.markdown("**⛔ Terminal**") 253 st.dataframe(terminal_df, hide_index=True) 254 else: 255 st.dataframe(children_df, hide_index=True) 256 ``` 257 258 Add an "Outreach Breakdown" section similarly below, with channel sub-breakdown and failed reason tree. 259 260 **Modify: `dashboard/utils/database.py`** — add `get_status_tree()` and `get_outreach_tree()` functions that read from `dashboard_cache` key `status_tree` / `outreach_tree`. 261 262 --- 263 264 ## Phase 4: Precalc Integration 265 266 **Modify: `src/cron/precompute-dashboard.js`** 267 268 Add `cacheStatusTree(db)` function that: 269 270 1. Calls `buildStatusTree(db)` and `buildOutreachTree(db)` from `error-categories.js` 271 2. Stores results in `dashboard_cache` as keys `status_tree` and `outreach_tree` 272 3. Called during the existing `precomputeAll()` function alongside other cache functions 273 274 The CLI (`npm run status`) runs queries directly (one-shot, no caching needed). 275 276 --- 277 278 ## Phase 5: Bug Fixes 279 280 ### Fix 1: SMS Business Hours Bug (`src/outreach/sms.js`) 281 282 **Problem:** Line 123 throws `new Error('SMS blocked: outside business hours')`, caught at line 187 which sets `status='failed'` without `error_message`. 1,908 outreaches permanently lost. 283 284 **Fix:** For business hours blocks, return early instead of throwing: 285 286 ```javascript 287 // Line 121-123: Replace throw with return 288 // Business hours: don't throw, return so status stays 'approved' for next attempt 289 return { success: false, outreachId, reason: 'business_hours', skipped: true }; 290 ``` 291 292 Also fix the catch block at line 187-191 to persist `error_message`: 293 294 ```javascript 295 db.prepare(`UPDATE outreaches SET status = 'failed', error_message = ? WHERE id = ?`).run( 296 error.message, 297 outreachId 298 ); 299 ``` 300 301 **Data fix:** Reset the 1,908 broken outreaches back to `approved`: 302 303 ```sql 304 UPDATE outreaches SET status = 'approved', error_message = NULL 305 WHERE contact_method='sms' AND status='failed' AND (error_message IS NULL OR error_message=''); 306 ``` 307 308 ### Fix 2: Assets Timeout (`src/stages/assets.js`) 309 310 **Problem:** No per-site timeout. Hung Chromium freezes the entire browser loop for hours. 311 312 **Fix:** Wrap `captureSiteScreenshots` with `withTimeout()` (already exported from `src/utils/error-handler.js`): 313 314 ```javascript 315 return withTimeout( 316 captureSiteScreenshots(site.id, site.url, db, site.country_code), 317 120000, 318 `Capture timed out after 120s for ${site.url}` 319 ); 320 ``` 321 322 Apply at line 128 (HTML-only path) and line 299 (vision path) inside the `processBatch` callback. 323 324 ### Fix 3: EACCES + Browser Conflict Sites Reset 325 326 Reset both categories (79 + 7,583 = 7,662 sites): 327 328 ```sql 329 -- 79 EACCES sites (unmounted SSD) 330 UPDATE sites SET status = 'found', retry_count = 0, error_message = NULL 331 WHERE status = 'failing' AND error_message LIKE '%EACCES%'; 332 333 -- 7,583 browser launch conflict sites (Playwright --user-data-dir error) 334 UPDATE sites SET status = 'found', retry_count = 0, error_message = NULL 335 WHERE status = 'failing' AND error_message LIKE '%userDataDir%'; 336 ``` 337 338 ### Fix 4: Outreach Reputation Guard (`src/utils/outreach-guard.js`) 339 340 **Problem:** No protection against burning Twilio/Resend reputation by sending thousands of messages that all fail with the same error. The 1,908 SMS and 1,282 email failures happened in rapid succession. 341 342 **Solution:** A lightweight error dedup tracker (not opossum — this sits _above_ the circuit breaker): 343 344 ```javascript 345 // In-memory sliding window: tracks error messages per channel in last 2h 346 const errorCounts = new Map(); // key: "sms:error message" → [{timestamp}, ...] 347 const WINDOW_MS = 2 * 60 * 60 * 1000; // 2 hours 348 const THRESHOLD = 25; // same error 25 times → halt channel 349 350 export function recordOutreachError(channel, errorMessage) { ... } 351 export function shouldHaltChannel(channel) { ... } 352 export function getHaltedChannels() { ... } 353 export function clearHalt(channel) { ... } 354 ``` 355 356 **Integration into `src/outreach/sms.js` and `src/outreach/email.js`:** 357 358 - Before sending: `if (shouldHaltChannel('sms')) { return { skipped: true, reason: 'channel_halted' }; }` 359 - On failure: `recordOutreachError('sms', error.message)` 360 - The existing opossum circuit breakers handle rate limits (429/503). This guard handles _application-level_ errors like bad phone formats, auth failures, or unexpected API responses that repeat. 361 362 **Integration into `src/stages/outreach.js`:** 363 364 - At start of each channel batch: check `shouldHaltChannel(method)` and skip with warning log 365 - Log when halted: `"⛔ SMS channel halted: 25× 'outside business hours' errors in 2h window"` 366 367 **Persistence:** In-memory only (resets on pipeline restart). This is intentional — after a restart, the underlying issue is presumably fixed or the window has passed. 368 369 --- 370 371 ## Files Summary 372 373 | File | Action | Purpose | 374 | ----------------------------------------------- | ---------- | ------------------------------------------------------------------------- | 375 | `src/utils/error-categories.js` | **Create** | Regex patterns, categorizeError(), buildStatusTree(), buildOutreachTree() | 376 | `src/cli/status.js` | **Create** | CLI entry point for `npm run status` | 377 | `package.json` | **Modify** | Add `"status"` script | 378 | `src/cron/precompute-dashboard.js` | **Modify** | Add cacheStatusTree() to precalc | 379 | `src/cron/classify-unknown-errors.js` | **Create** | Daily Haiku LLM task to propose regex patterns for unknown errors | 380 | `db/migrations/067-error-pattern-proposals.sql` | **Create** | New table for LLM-proposed error patterns awaiting review | 381 | `db/schema.sql` | **Modify** | Add error_pattern_proposals table | 382 | `dashboard/pages/1_🔧_Pipeline.py` | **Modify** | Add expandable status breakdown + error pattern review panel | 383 | `dashboard/utils/database.py` | **Modify** | Add get_status_tree(), get_outreach_tree(), get_error_proposals() | 384 | `src/outreach/sms.js` | **Modify** | Fix business hours throw + error_message persistence | 385 | `src/stages/assets.js` | **Modify** | Add withTimeout() on captureSiteScreenshots | 386 | `src/utils/outreach-guard.js` | **Create** | Error dedup tracker — halt channel after 25 same errors in 2h | 387 | `src/stages/outreach.js` | **Modify** | Check shouldHaltChannel() before sending | 388 | `tests/utils/error-categories.test.js` | **Create** | Unit tests for categorization | 389 | `tests/outreach/sms-business-hours.test.js` | **Create** | Test SMS business hours fix | 390 | `tests/utils/outreach-guard.test.js` | **Create** | Test error dedup halting logic | 391 | `tests/cron/classify-unknown-errors.test.js` | **Create** | Test unknown error classification cron | 392 393 --- 394 395 ## Verification 396 397 1. `npm run status` — verify tree output in pipeline order, sub-status deltas visible, colors render 398 2. `npm run dashboard` — check Pipeline page has expandable status breakdown with sub-deltas 399 3. `npm test` — all tests pass 400 4. `node -e "import('./src/utils/error-categories.js').then(m => console.log(m.categorizeError('EACCES: permission denied')))"` — verify `{ group: 'retriable', label: 'Permission denied' }` 401 5. Verify 7,662 sites (79 EACCES + 7,583 browser conflict) reset to `found` 402 6. Verify 1,908 SMS outreaches reset to `approved` 403 7. Restart pipeline, confirm Assets captures timeout after 120s (not hang forever) 404 8. Test outreach guard: simulate 25 same-error SMS failures, verify channel halts 405 9. Verify `npm run status` shows outreach guard status (halted channels, if any) 406 10. Run `node src/cron/classify-unknown-errors.js` manually, verify proposals appear in `error_pattern_proposals` table 407 11. Approve a proposal in dashboard, verify it queues a Developer agent task to update `error-categories.js`