/ docs / 06-automation / circuit-breaker-management.md
circuit-breaker-management.md
  1  ---
  2  title: Agent Circuit Breaker Management
  3  category: 06-automation
  4  last_verified: 2026-02-16
  5  related_files:
  6    - scripts/reset-agent-circuit-breakers.js
  7    - src/agents/runner.js
  8    - src/utils/circuit-breaker.js
  9  tags: [agents, circuit-breaker, monitoring, recovery]
 10  status: active
 11  ---
 12  
 13  # Agent Circuit Breaker Management
 14  
 15  ## Overview
 16  
 17  The agent system uses circuit breakers to prevent cascading failures and excessive API costs during repeated failures. This document covers circuit breaker behavior, auto-recovery, and manual reset procedures.
 18  
 19  ## Circuit Breaker States
 20  
 21  ### Agent Circuit Breakers
 22  
 23  **States:**
 24  
 25  - `idle` - Normal operation, agent can process tasks
 26  - `blocked` - Circuit breaker triggered, agent cannot process tasks
 27  - `working` - Agent actively processing a task
 28  
 29  **Trigger Conditions:**
 30  
 31  - Failure rate > 30% (configurable via `AGENT_CIRCUIT_BREAKER_THRESHOLD`)
 32  - Minimum 10 tasks in 24-hour window
 33  - Circuit breaker timestamp stored in `agent_state.metrics_json.circuit_breaker_triggered_at`
 34  
 35  **Auto-Recovery:**
 36  
 37  - Cooldown: 30 minutes (configurable via `AGENT_CIRCUIT_BREAKER_COOLDOWN`)
 38  - Conditions:
 39    1. Cooldown period expired
 40    2. Failure rate dropped below threshold
 41  - Recovery logged to `agent_logs` table
 42  
 43  ### API Circuit Breakers
 44  
 45  **Separate circuit breakers for external APIs:**
 46  
 47  - OpenRouter (AI scoring) - 120s timeout, 2min cooldown
 48  - ZenRows (SERP scraping) - 180s timeout, 2min cooldown
 49  - Twilio (SMS) - 30s timeout, 1min cooldown
 50  - Resend (Email) - 30s timeout, 1min cooldown
 51  
 52  See `src/utils/circuit-breaker.js` for implementation.
 53  
 54  ## Smart Error Classification
 55  
 56  **Agent errors (do NOT trigger circuit breaker):**
 57  
 58  - Unknown task types
 59  - Task validation failures
 60  - Not implemented features
 61  - Invalid task context
 62  - Routed to Architect agent for system improvements
 63  
 64  **Business logic errors (DO trigger circuit breaker):**
 65  
 66  - Database errors (UNIQUE constraint, connection errors)
 67  - Network timeouts (ETIMEDOUT, ECONNREFUSED)
 68  - API errors (500, 502, 503)
 69  - Runtime errors (null pointer, undefined)
 70  
 71  See `src/agents/triage.js` for classification logic.
 72  
 73  ## Manual Circuit Breaker Reset
 74  
 75  ### Script: `reset-agent-circuit-breakers.js`
 76  
 77  **Location:** `scripts/reset-agent-circuit-breakers.js`
 78  
 79  **Usage:**
 80  
 81  ```bash
 82  # Dry run (show what would be reset)
 83  node scripts/reset-agent-circuit-breakers.js --dry-run
 84  
 85  # Reset circuit breakers older than 30 minutes
 86  node scripts/reset-agent-circuit-breakers.js
 87  
 88  # Force reset all circuit breakers immediately
 89  node scripts/reset-agent-circuit-breakers.js --force
 90  
 91  # Reset circuit breakers AND cleanup old failed tasks
 92  node scripts/reset-agent-circuit-breakers.js --cleanup-tasks
 93  
 94  # Dry run with all options
 95  node scripts/reset-agent-circuit-breakers.js --dry-run --cleanup-tasks
 96  ```
 97  
 98  **Options:**
 99  
100  - `--dry-run` - Show what would be reset without making changes
101  - `--cleanup-tasks` - Mark failed tasks older than 24 hours as cancelled
102  - `--force` - Reset all circuit breakers regardless of cooldown period
103  
104  **Output:**
105  
106  ```
107  Agent Circuit Breaker Reset Tool
108  =================================
109  
110  Found 3 blocked agents:
111  
112    developer:
113      - Triggered: 2026-02-15T09:22:15.000Z
114      - Age: 1439 minutes
115      - Failure rate: 35.7%
116      - Action: RESET (cooldown expired)
117  
118    architect:
119      - Triggered: 2026-02-16T01:59:47.000Z
120      - Age: 15 minutes
121      - Failure rate: 44.0%
122      - Action: SKIP (cooldown expires in 15 minutes)
123  
124    monitor:
125      - Triggered: 2026-02-15T09:22:15.000Z
126      - Age: 1439 minutes
127      - Failure rate: 35.3%
128      - Action: RESET (cooldown expired)
129  
130  Reset 2/3 circuit breakers
131  
132  Current Agent Status:
133  =====================
134  
135    developer: ✅ ACTIVE
136      - Success rate: 15.4%
137      - Failure rate: 35.7%
138      - Total tasks (24h): 26
139  
140    architect: 🔴 BLOCKED
141      - Success rate: 41.7%
142      - Failure rate: 44.0%
143      - Total tasks (24h): 24
144  
145    monitor: ✅ ACTIVE
146      - Success rate: 47.1%
147      - Failure rate: 35.3%
148      - Total tasks (24h): 17
149  ```
150  
151  ## Checking Circuit Breaker Status
152  
153  ### SQL Queries
154  
155  **Check blocked agents:**
156  
157  ```sql
158  SELECT
159    agent_name,
160    status,
161    json_extract(metrics_json, '$.circuit_breaker_triggered_at') as triggered_at,
162    json_extract(metrics_json, '$.failure_rate') as failure_rate,
163    json_extract(metrics_json, '$.total_tasks_24h') as total_tasks
164  FROM agent_state
165  WHERE status = 'blocked';
166  ```
167  
168  **Check agent metrics:**
169  
170  ```sql
171  SELECT
172    agent_name,
173    status,
174    json_extract(metrics_json, '$.success_rate') as success_rate,
175    json_extract(metrics_json, '$.failure_rate') as failure_rate,
176    json_extract(metrics_json, '$.total_tasks_24h') as total_tasks_24h,
177    json_extract(metrics_json, '$.last_health_check') as last_health_check
178  FROM agent_state
179  ORDER BY agent_name;
180  ```
181  
182  **Check recent agent failures:**
183  
184  ```sql
185  SELECT
186    id,
187    task_type,
188    assigned_to,
189    error_message,
190    created_at,
191    julianday('now') - julianday(created_at) as age_days
192  FROM agent_tasks
193  WHERE status = 'failed'
194    AND created_at > datetime('now', '-24 hours')
195  ORDER BY created_at DESC
196  LIMIT 20;
197  ```
198  
199  ### CLI Commands
200  
201  **Check agent status:**
202  
203  ```bash
204  npm run agent:list
205  ```
206  
207  **View failed tasks:**
208  
209  ```bash
210  npm run agent:tasks -- --status failed
211  ```
212  
213  ## Task Cleanup
214  
215  ### Old Failed Tasks
216  
217  **Why cleanup:**
218  
219  - Reduces noise in task queue
220  - Improves query performance
221  - Prevents stale task accumulation
222  
223  **Cleanup criteria:**
224  
225  - Status: `failed`
226  - Age: > 24 hours
227  - Action: Mark as `cancelled` with timestamp
228  
229  **Manual cleanup:**
230  
231  ```bash
232  node scripts/reset-agent-circuit-breakers.js --cleanup-tasks
233  ```
234  
235  **SQL cleanup:**
236  
237  ```sql
238  -- Show old failed tasks
239  SELECT
240    id, task_type, assigned_to, error_message,
241    julianday('now') - julianday(created_at) as age_days
242  FROM agent_tasks
243  WHERE status = 'failed'
244    AND created_at < datetime('now', '-24 hours')
245  ORDER BY created_at ASC;
246  
247  -- Mark as cancelled
248  UPDATE agent_tasks
249  SET status = 'cancelled',
250      error_message = COALESCE(error_message, '') || ' [Auto-cancelled after 24h]',
251      completed_at = datetime('now')
252  WHERE status = 'failed'
253    AND created_at < datetime('now', '-24 hours');
254  ```
255  
256  ## Environment Variables
257  
258  **Agent Circuit Breaker:**
259  
260  - `AGENT_CIRCUIT_BREAKER_THRESHOLD` - Failure rate threshold (default: `0.3` = 30%)
261  - `AGENT_CIRCUIT_BREAKER_COOLDOWN` - Cooldown period in minutes (default: `30`)
262  
263  **API Circuit Breakers:**
264  
265  - Set via `src/utils/circuit-breaker.js` (not configurable via env)
266  
267  ## Troubleshooting
268  
269  ### Agent stuck in blocked state
270  
271  **Symptoms:**
272  
273  - Agent status = `blocked`
274  - Circuit breaker triggered hours/days ago
275  - Failure rate dropped but still blocked
276  
277  **Solution:**
278  
279  ```bash
280  # Check if auto-recovery should have happened
281  node scripts/reset-agent-circuit-breakers.js --dry-run
282  
283  # Force reset if needed
284  node scripts/reset-agent-circuit-breakers.js --force
285  ```
286  
287  ### High failure rate persists
288  
289  **Symptoms:**
290  
291  - Agent resets but immediately triggers again
292  - Failure rate > 30% consistently
293  
294  **Investigation:**
295  
296  ```sql
297  -- Check recent failures
298  SELECT error_message, COUNT(*) as count
299  FROM agent_tasks
300  WHERE status = 'failed'
301    AND assigned_to = 'developer'
302    AND created_at > datetime('now', '-24 hours')
303  GROUP BY error_message
304  ORDER BY count DESC;
305  ```
306  
307  **Actions:**
308  
309  1. Review error patterns in failed tasks
310  2. Check if errors are agent system errors (should go to Architect)
311  3. Verify error classification in `src/agents/triage.js`
312  4. Fix underlying issues before resetting circuit breaker
313  
314  ### Circuit breaker logs
315  
316  **View recovery logs:**
317  
318  ```sql
319  SELECT
320    agent_name,
321    message,
322    metadata_json,
323    created_at
324  FROM agent_logs
325  WHERE message LIKE '%circuit breaker%'
326  ORDER BY created_at DESC
327  LIMIT 20;
328  ```
329  
330  ## Pre-Activation Checklist
331  
332  **Before activating agent system (enabling cron):**
333  
334  1. **Reset circuit breakers:**
335  
336     ```bash
337     node scripts/reset-agent-circuit-breakers.js --cleanup-tasks
338     ```
339  
340  2. **Verify agent status:**
341  
342     ```bash
343     npm run agent:list
344     ```
345  
346  3. **Check task queue:**
347  
348     ```bash
349     npm run agent:tasks
350     ```
351  
352  4. **Review recent failures:**
353  
354     ```sql
355     SELECT assigned_to, COUNT(*) as count
356     FROM agent_tasks
357     WHERE status = 'failed'
358       AND created_at > datetime('now', '-24 hours')
359     GROUP BY assigned_to;
360     ```
361  
362  5. **Enable cron if all clear:**
363     ```bash
364     # Set AGENT_SYSTEM_ENABLED=true in .env
365     systemctl restart 333method-agent
366     ```
367  
368  ## Best Practices
369  
370  1. **Monitor agent health:**
371     - Check `npm run agent:list` daily
372     - Review failure rates weekly
373     - Investigate spikes in failures immediately
374  
375  2. **Auto-recovery first:**
376     - Let circuit breakers auto-recover when possible
377     - Only force reset if truly necessary
378     - Document reason for force resets
379  
380  3. **Cleanup regularly:**
381     - Run `--cleanup-tasks` weekly
382     - Keep failed task history under 1000 records
383     - Archive old agent_logs monthly
384  
385  4. **Error classification:**
386     - Review `src/agents/triage.js` quarterly
387     - Add new agent error patterns as needed
388     - Ensure business logic errors counted correctly
389  
390  5. **Testing:**
391     - Use `--dry-run` before production resets
392     - Test circuit breaker behavior in staging
393     - Verify auto-recovery works as expected
394  
395  ## See Also
396  
397  - [Agent System Documentation](agent-system.md)
398  - [Cron System Documentation](cron-system.md)
399  - [Circuit Breaker Implementation](../../src/utils/circuit-breaker.js)
400  - [Agent Runner](../../src/agents/runner.js)