/ docs / plans / implemented / pipeline-status-breakdown.md
pipeline-status-breakdown.md
  1  # Pipeline Status Breakdown System
  2  
  3  > ✅ IMPLEMENTED (2026-02-28, commit 6e2c42d8)
  4  >
  5  > **Deliverables:** `src/utils/error-categories.js`, `src/cli/status.js` (`npm run status`), `src/utils/outreach-guard.js`, `src/cron/classify-unknown-errors.js`, dashboard widgets, SMS business hours fix, assets 120s timeout, 7,662 sites reset, 1,908 outreaches reset. 39 new tests. 17 files changed.
  6  
  7  ## Context
  8  
  9  The pipeline has 335K ignored and 107K failing sites, 3.3K failed outreaches — but no visibility into _why_. Error messages exist in the DB but aren't categorized or surfaced. The user asked for a tree-view breakdown showing totals, 24h/1h deltas, and sub-statuses grouped by reason — both as a CLI command and a dashboard widget.
 10  
 11  During investigation, we also found:
 12  
 13  - **1,908 SMS outreaches permanently failed with no error** — caused by a bug where business-hours blocks throw, hit a catch that sets `status='failed'` without `error_message`
 14  - **Assets stage hangs freeze the entire browser loop** — no per-site timeout on Playwright captures, which froze the user's host Chromium too
 15  - **79 sites stuck at `failing` from EACCES** — unmounted SSD `/run/media/jason/store`; user wants them reprocessed
 16  - **7,583 sites stuck at `failing` from browser launch conflict** — Playwright `--user-data-dir` error; reset for reprocessing
 17  - **No outreach error dedup protection** — need a "reputation guard" that halts sending after 25 of the same error in 2h, for both Twilio and Resend
 18  
 19  ---
 20  
 21  ## Phase 1: Error Categorization Module
 22  
 23  **New file: `src/utils/error-categories.js`**
 24  
 25  Regex-based pattern matching — no LLM needed. Two groups per context:
 26  
 27  ### Site Error Patterns
 28  
 29  **Terminal (ignore-type, won't be retried):**
 30  | Pattern | Label |
 31  |---|---|
 32  | `/Social media platform/i` | Social media |
 33  | `/Business directory/i` | Business directory |
 34  | `/Not a local business/i` | Not a local business |
 35  | `/Home service franchise/i` | Franchise |
 36  | `/Government/i` | Government domain |
 37  | `/Education domain/i` | Education domain |
 38  | `/Non-commercial/i` | Non-commercial |
 39  | `/Cross-border duplicate/i` | Cross-border duplicate |
 40  | `/Duplicate domain/i` | Duplicate domain |
 41  | `/Country mismatch/i` | Country mismatch |
 42  | `/No email template/i` | No template |
 43  | `/HTTP 40[34].*Cannot capture/i` | Permanent HTTP error |
 44  | `/Max recapture attempts/i` | Max recapture retries |
 45  
 46  **Retriable (could succeed if retried/fixed):**
 47  | Pattern | Label |
 48  |---|---|
 49  | `/userDataDir\|launchPersistentContext/i` | Browser launch conflict |
 50  | `/status code 400/i` | API bad request |
 51  | `/status code 401/i` | API auth error |
 52  | `/status code 429/i` | Rate limited |
 53  | `/Incomplete LLM response/i` | Incomplete LLM |
 54  | `/Timed out\|Timeout/i` | Timeout |
 55  | `/ECONNRESET\|ETIMEDOUT\|ENOTFOUND/i` | Network error |
 56  | `/navigating and changing/i` | Page navigating |
 57  | `/browser has been closed/i` | Browser crashed |
 58  | `/EACCES/i` | Permission denied |
 59  | `/screenshot_path is NULL/i` | Missing screenshot |
 60  | `/database is locked/i` | DB lock |
 61  | `/chromium-nice ENOENT/i` | Chromium not found |
 62  | `/HTTP (5\d\d\|null)/i` | Server error |
 63  
 64  ### Outreach Error Patterns
 65  
 66  **Terminal:**
 67  | Pattern | Label |
 68  |---|---|
 69  | `/opted out/i` | Opted out |
 70  | `/Fake.*phone/i` | Invalid phone |
 71  | `/gdpr_blocked/i` | GDPR blocked |
 72  | `/Cloudflare.*blocked/i` | Cloudflare blocked |
 73  
 74  **Retriable:**
 75  | Pattern | Label |
 76  |---|---|
 77  | `/business hours/i` | Business hours block |
 78  | `/status.?code 429\|rate/i` | Rate limited |
 79  | `/Timeout/i` | Form timeout |
 80  | `/net::ERR_NAME_NOT_RESOLVED/i` | DNS failure |
 81  | `NULL or empty` | Unknown (no error stored) |
 82  
 83  ### Exports
 84  
 85  ```javascript
 86  export function categorizeError(errorMessage, context = 'site')
 87    // Returns { group: 'terminal'|'retriable'|'unknown', label: string }
 88  
 89  export function buildStatusTree(db)
 90    // Returns structured tree with totals, deltas, children grouped by terminal/retriable
 91  
 92  export function buildOutreachTree(db)
 93    // Returns outreach breakdown with same structure
 94  ```
 95  
 96  The `buildStatusTree` SQL for deltas:
 97  
 98  - Top-level: `COUNT(CASE WHEN updated_at > datetime('now','-24 hours') THEN 1 END)` per status
 99  - Sub-status deltas: same pattern grouped by `error_message` category — each child row gets its own `delta_24h` and `delta_1h`
100  
101  Cross-border duplicates collapsed into one row with total count.
102  
103  ---
104  
105  ## Phase 1b: Auto-Expanding Error Patterns (Daily Agent Task)
106  
107  **Problem:** New error messages will emerge that don't match any regex. They'll pile up under "Unknown" without visibility.
108  
109  **Solution:** A daily cron task that:
110  
111  1. Queries `sites` and `outreaches` for `error_message` values that fall through to `unknown` in `categorizeError()`
112  2. Groups them by frequency (top uncategorized errors, min 5 occurrences)
113  3. Makes an LLM call (Haiku) asking: "Given these error messages, suggest regex patterns and labels for each group. Format as JSON array: `[{ pattern, label, group, context }]`"
114  4. **Writes proposals to a new DB table** `error_pattern_proposals` — does NOT auto-commit to code
115  5. A human reviews via dashboard and approves/rejects — approved patterns get merged into `error-categories.js`
116  
117  **New table: `error_pattern_proposals`**
118  
119  ```sql
120  CREATE TABLE error_pattern_proposals (
121    id INTEGER PRIMARY KEY AUTOINCREMENT,
122    pattern TEXT NOT NULL,
123    label TEXT NOT NULL,
124    group_name TEXT NOT NULL CHECK(group_name IN ('terminal', 'retriable')),
125    context TEXT NOT NULL CHECK(context IN ('site', 'outreach')),
126    example_errors TEXT,        -- JSON array of sample error messages that matched
127    occurrence_count INTEGER,   -- how many errors this would have caught
128    status TEXT DEFAULT 'pending' CHECK(status IN ('pending', 'approved', 'rejected')),
129    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
130    reviewed_at DATETIME,
131    reviewed_by TEXT
132  );
133  ```
134  
135  **New file: `src/cron/classify-unknown-errors.js`**
136  
137  - Runs daily (via `cron_jobs` table entry `classifyUnknownErrors`)
138  - Uses `claude-haiku-4-5-20251001` via OpenRouter
139  - Prompt: structured JSON output only, no prose
140  - On approval: the agent task system queues a Developer task to update `error-categories.js` with the new pattern
141  
142  **Dashboard: add review panel to `dashboard/pages/1_🔧_Pipeline.py`**
143  
144  - Show pending proposals with sample errors, occurrence count, suggested pattern
145  - Approve/reject buttons (write back via `database.py`)
146  
147  **New migration:** `db/migrations/067-error-pattern-proposals.sql`
148  
149  ---
150  
151  ## Phase 2: CLI Command — `npm run status`
152  
153  **New file: `src/cli/status.js`**
154  
155  Output format (ANSI colored, box-drawing chars). **Statuses in pipeline order** (found → assets_captured → scored → rescored → enriched → proposals_drafted → outreach_sent → high_score, then ignore and failing at the end as non-pipeline buckets):
156  
157  ```
158  Pipeline Status                      Total     Δ24h     Δ1h
159  ──────────────────────────────────────────────────────────────
160  ▸ found                             26,646   +2,206      +0
161  ▸ assets_captured                      294     +257      +0
162  ▸ scored                                 0       +0      +0
163  ▸ rescored                          25,294     +352      +0
164  ▸ enriched                               0       +0      +0
165  ▸ proposals_drafted                      0       +0      +0
166  ▸ outreach_sent                      2,273       +0      +0
167  ▸ high_score                             3       +0      +0
168  ▸ ignore                           334,956   +5,148      +0
169    ├─ Cross-border duplicate        250,799      +0      +0
170    ├─ Social media                   56,252      +0      +0
171    ├─ Business directory             45,919      +0      +0
172    ├─ Duplicate domain                9,369      +0      +0
173    ├─ Not a local business            6,435      +0      +0
174    ├─ Franchise                       6,229      +0      +0
175    ├─ Non-commercial                  4,054      +0      +0
176    ├─ Government/Education            3,899      +0      +0
177    └─ Other                             799      +0      +0
178  ▸ failing                          106,843       +0      +0
179    ├─ Retriable
180    │  ├─ Browser launch conflict      7,583      +0      +0
181    │  ├─ API bad request              2,904      +0      +0
182    │  ├─ Missing screenshot             354      +0      +0
183    │  └─ ... (more)
184    └─ Terminal
185       ├─ Permanent HTTP error            68      +0      +0
186       └─ ... (more)
187  
188  Outreach Breakdown                   Total     Δ24h     Δ1h
189  ──────────────────────────────────────────────────────────────
190  ▸ pending                               45       +0      +0
191  ▸ approved (in cooldown)             2,865       +0      +0
192    ├─ form                            2,139      +0      +0
193    ├─ linkedin                          241      +0      +0
194    ├─ x                                 217      +0      +0
195    ├─ email                             136      +0      +0
196    └─ sms                               132      +0      +0
197  ▸ sent                                 215       +0      +0
198    ├─ sms                               112      +0      +0
199    └─ email                             103      +0      +0
200  ▸ delivered                              0       +0      +0
201  ▸ opened                                 0       +0      +0
202  ▸ clicked                                0       +0      +0
203  ▸ replied                                0       +0      +0
204  ▸ failed                             3,260       +0      +0
205    ├─ Retriable
206    │  ├─ Business hours block          1,807      +0      +0
207    │  ├─ Rate limited (Resend 429)    1,282      +0      +0
208    │  └─ Unknown (no error stored)      101      +0      +0
209    └─ Terminal
210       ├─ Invalid phone                    25      +0      +0
211       ├─ Cloudflare blocked                1      +0      +0
212       └─ Form timeout (terminal)          34      +0      +0
213  ▸ bounced                               12       +0      +0
214  ▸ gdpr_blocked                         347       +0      +0
215  ```
216  
217  **Status ordering:**
218  
219  - Sites: pipeline sequence first (`found → assets_captured → scored → rescored → enriched → proposals_drafted → outreach_sent → high_score`), then non-pipeline (`ignore`, `failing`) at bottom
220  - Outreaches: delivery lifecycle order (`pending → approved → sent → delivered → opened → clicked → replied → failed → bounced → gdpr_blocked`)
221  
222  **Color rules:**
223  
224  - Green delta: increases on good statuses (`enriched`, `outreach_sent`, `sent`, `delivered`, `opened`, `replied`)
225  - Red delta: increases on bad statuses (`failing`, `failed`, `bounced`)
226  - Grey: `ignore` and duplicates (expected, not actionable)
227  - Cyan: neutral mid-pipeline statuses (`found`, `rescored`, `assets_captured`, `approved`)
228  
229  Uses `src/utils/summary-generator.js` color patterns. Add to `package.json`: `"status": "node src/cli/status.js"`
230  
231  ---
232  
233  ## Phase 3: Dashboard Widget
234  
235  **Modify: `dashboard/pages/1_🔧_Pipeline.py`**
236  
237  Add a "Status Breakdown" section using `st.expander()` per status:
238  
239  ```python
240  st.subheader("Status Breakdown")
241  for row in status_tree:
242      label = f"{row['status']} — {row['total']:,}"
243      delta_parts = []
244      if row['delta_24h']: delta_parts.append(f"+{row['delta_24h']:,} 24h")
245      if row['delta_1h']: delta_parts.append(f"+{row['delta_1h']:,} 1h")
246      if delta_parts: label += f" ({', '.join(delta_parts)})"
247  
248      with st.expander(label):
249          if 'retriable' in row['children']:
250              st.markdown("**🔄 Retriable**")
251              st.dataframe(retriable_df, hide_index=True)
252              st.markdown("**⛔ Terminal**")
253              st.dataframe(terminal_df, hide_index=True)
254          else:
255              st.dataframe(children_df, hide_index=True)
256  ```
257  
258  Add an "Outreach Breakdown" section similarly below, with channel sub-breakdown and failed reason tree.
259  
260  **Modify: `dashboard/utils/database.py`** — add `get_status_tree()` and `get_outreach_tree()` functions that read from `dashboard_cache` key `status_tree` / `outreach_tree`.
261  
262  ---
263  
264  ## Phase 4: Precalc Integration
265  
266  **Modify: `src/cron/precompute-dashboard.js`**
267  
268  Add `cacheStatusTree(db)` function that:
269  
270  1. Calls `buildStatusTree(db)` and `buildOutreachTree(db)` from `error-categories.js`
271  2. Stores results in `dashboard_cache` as keys `status_tree` and `outreach_tree`
272  3. Called during the existing `precomputeAll()` function alongside other cache functions
273  
274  The CLI (`npm run status`) runs queries directly (one-shot, no caching needed).
275  
276  ---
277  
278  ## Phase 5: Bug Fixes
279  
280  ### Fix 1: SMS Business Hours Bug (`src/outreach/sms.js`)
281  
282  **Problem:** Line 123 throws `new Error('SMS blocked: outside business hours')`, caught at line 187 which sets `status='failed'` without `error_message`. 1,908 outreaches permanently lost.
283  
284  **Fix:** For business hours blocks, return early instead of throwing:
285  
286  ```javascript
287  // Line 121-123: Replace throw with return
288  // Business hours: don't throw, return so status stays 'approved' for next attempt
289  return { success: false, outreachId, reason: 'business_hours', skipped: true };
290  ```
291  
292  Also fix the catch block at line 187-191 to persist `error_message`:
293  
294  ```javascript
295  db.prepare(`UPDATE outreaches SET status = 'failed', error_message = ? WHERE id = ?`).run(
296    error.message,
297    outreachId
298  );
299  ```
300  
301  **Data fix:** Reset the 1,908 broken outreaches back to `approved`:
302  
303  ```sql
304  UPDATE outreaches SET status = 'approved', error_message = NULL
305  WHERE contact_method='sms' AND status='failed' AND (error_message IS NULL OR error_message='');
306  ```
307  
308  ### Fix 2: Assets Timeout (`src/stages/assets.js`)
309  
310  **Problem:** No per-site timeout. Hung Chromium freezes the entire browser loop for hours.
311  
312  **Fix:** Wrap `captureSiteScreenshots` with `withTimeout()` (already exported from `src/utils/error-handler.js`):
313  
314  ```javascript
315  return withTimeout(
316    captureSiteScreenshots(site.id, site.url, db, site.country_code),
317    120000,
318    `Capture timed out after 120s for ${site.url}`
319  );
320  ```
321  
322  Apply at line 128 (HTML-only path) and line 299 (vision path) inside the `processBatch` callback.
323  
324  ### Fix 3: EACCES + Browser Conflict Sites Reset
325  
326  Reset both categories (79 + 7,583 = 7,662 sites):
327  
328  ```sql
329  -- 79 EACCES sites (unmounted SSD)
330  UPDATE sites SET status = 'found', retry_count = 0, error_message = NULL
331  WHERE status = 'failing' AND error_message LIKE '%EACCES%';
332  
333  -- 7,583 browser launch conflict sites (Playwright --user-data-dir error)
334  UPDATE sites SET status = 'found', retry_count = 0, error_message = NULL
335  WHERE status = 'failing' AND error_message LIKE '%userDataDir%';
336  ```
337  
338  ### Fix 4: Outreach Reputation Guard (`src/utils/outreach-guard.js`)
339  
340  **Problem:** No protection against burning Twilio/Resend reputation by sending thousands of messages that all fail with the same error. The 1,908 SMS and 1,282 email failures happened in rapid succession.
341  
342  **Solution:** A lightweight error dedup tracker (not opossum — this sits _above_ the circuit breaker):
343  
344  ```javascript
345  // In-memory sliding window: tracks error messages per channel in last 2h
346  const errorCounts = new Map();  // key: "sms:error message" → [{timestamp}, ...]
347  const WINDOW_MS = 2 * 60 * 60 * 1000;  // 2 hours
348  const THRESHOLD = 25;  // same error 25 times → halt channel
349  
350  export function recordOutreachError(channel, errorMessage) { ... }
351  export function shouldHaltChannel(channel) { ... }
352  export function getHaltedChannels() { ... }
353  export function clearHalt(channel) { ... }
354  ```
355  
356  **Integration into `src/outreach/sms.js` and `src/outreach/email.js`:**
357  
358  - Before sending: `if (shouldHaltChannel('sms')) { return { skipped: true, reason: 'channel_halted' }; }`
359  - On failure: `recordOutreachError('sms', error.message)`
360  - The existing opossum circuit breakers handle rate limits (429/503). This guard handles _application-level_ errors like bad phone formats, auth failures, or unexpected API responses that repeat.
361  
362  **Integration into `src/stages/outreach.js`:**
363  
364  - At start of each channel batch: check `shouldHaltChannel(method)` and skip with warning log
365  - Log when halted: `"⛔ SMS channel halted: 25× 'outside business hours' errors in 2h window"`
366  
367  **Persistence:** In-memory only (resets on pipeline restart). This is intentional — after a restart, the underlying issue is presumably fixed or the window has passed.
368  
369  ---
370  
371  ## Files Summary
372  
373  | File                                            | Action     | Purpose                                                                   |
374  | ----------------------------------------------- | ---------- | ------------------------------------------------------------------------- |
375  | `src/utils/error-categories.js`                 | **Create** | Regex patterns, categorizeError(), buildStatusTree(), buildOutreachTree() |
376  | `src/cli/status.js`                             | **Create** | CLI entry point for `npm run status`                                      |
377  | `package.json`                                  | **Modify** | Add `"status"` script                                                     |
378  | `src/cron/precompute-dashboard.js`              | **Modify** | Add cacheStatusTree() to precalc                                          |
379  | `src/cron/classify-unknown-errors.js`           | **Create** | Daily Haiku LLM task to propose regex patterns for unknown errors         |
380  | `db/migrations/067-error-pattern-proposals.sql` | **Create** | New table for LLM-proposed error patterns awaiting review                 |
381  | `db/schema.sql`                                 | **Modify** | Add error_pattern_proposals table                                         |
382  | `dashboard/pages/1_🔧_Pipeline.py`              | **Modify** | Add expandable status breakdown + error pattern review panel              |
383  | `dashboard/utils/database.py`                   | **Modify** | Add get_status_tree(), get_outreach_tree(), get_error_proposals()         |
384  | `src/outreach/sms.js`                           | **Modify** | Fix business hours throw + error_message persistence                      |
385  | `src/stages/assets.js`                          | **Modify** | Add withTimeout() on captureSiteScreenshots                               |
386  | `src/utils/outreach-guard.js`                   | **Create** | Error dedup tracker — halt channel after 25 same errors in 2h             |
387  | `src/stages/outreach.js`                        | **Modify** | Check shouldHaltChannel() before sending                                  |
388  | `tests/utils/error-categories.test.js`          | **Create** | Unit tests for categorization                                             |
389  | `tests/outreach/sms-business-hours.test.js`     | **Create** | Test SMS business hours fix                                               |
390  | `tests/utils/outreach-guard.test.js`            | **Create** | Test error dedup halting logic                                            |
391  | `tests/cron/classify-unknown-errors.test.js`    | **Create** | Test unknown error classification cron                                    |
392  
393  ---
394  
395  ## Verification
396  
397  1. `npm run status` — verify tree output in pipeline order, sub-status deltas visible, colors render
398  2. `npm run dashboard` — check Pipeline page has expandable status breakdown with sub-deltas
399  3. `npm test` — all tests pass
400  4. `node -e "import('./src/utils/error-categories.js').then(m => console.log(m.categorizeError('EACCES: permission denied')))"` — verify `{ group: 'retriable', label: 'Permission denied' }`
401  5. Verify 7,662 sites (79 EACCES + 7,583 browser conflict) reset to `found`
402  6. Verify 1,908 SMS outreaches reset to `approved`
403  7. Restart pipeline, confirm Assets captures timeout after 120s (not hang forever)
404  8. Test outreach guard: simulate 25 same-error SMS failures, verify channel halts
405  9. Verify `npm run status` shows outreach guard status (halted channels, if any)
406  10. Run `node src/cron/classify-unknown-errors.js` manually, verify proposals appear in `error_pattern_proposals` table
407  11. Approve a proposal in dashboard, verify it queues a Developer agent task to update `error-categories.js`