circuit-breaker-management.md
1 --- 2 title: Agent Circuit Breaker Management 3 category: 06-automation 4 last_verified: 2026-02-16 5 related_files: 6 - scripts/reset-agent-circuit-breakers.js 7 - src/agents/runner.js 8 - src/utils/circuit-breaker.js 9 tags: [agents, circuit-breaker, monitoring, recovery] 10 status: active 11 --- 12 13 # Agent Circuit Breaker Management 14 15 ## Overview 16 17 The agent system uses circuit breakers to prevent cascading failures and excessive API costs during repeated failures. This document covers circuit breaker behavior, auto-recovery, and manual reset procedures. 18 19 ## Circuit Breaker States 20 21 ### Agent Circuit Breakers 22 23 **States:** 24 25 - `idle` - Normal operation, agent can process tasks 26 - `blocked` - Circuit breaker triggered, agent cannot process tasks 27 - `working` - Agent actively processing a task 28 29 **Trigger Conditions:** 30 31 - Failure rate > 30% (configurable via `AGENT_CIRCUIT_BREAKER_THRESHOLD`) 32 - Minimum 10 tasks in 24-hour window 33 - Circuit breaker timestamp stored in `agent_state.metrics_json.circuit_breaker_triggered_at` 34 35 **Auto-Recovery:** 36 37 - Cooldown: 30 minutes (configurable via `AGENT_CIRCUIT_BREAKER_COOLDOWN`) 38 - Conditions: 39 1. Cooldown period expired 40 2. Failure rate dropped below threshold 41 - Recovery logged to `agent_logs` table 42 43 ### API Circuit Breakers 44 45 **Separate circuit breakers for external APIs:** 46 47 - OpenRouter (AI scoring) - 120s timeout, 2min cooldown 48 - ZenRows (SERP scraping) - 180s timeout, 2min cooldown 49 - Twilio (SMS) - 30s timeout, 1min cooldown 50 - Resend (Email) - 30s timeout, 1min cooldown 51 52 See `src/utils/circuit-breaker.js` for implementation. 53 54 ## Smart Error Classification 55 56 **Agent errors (do NOT trigger circuit breaker):** 57 58 - Unknown task types 59 - Task validation failures 60 - Not implemented features 61 - Invalid task context 62 - Routed to Architect agent for system improvements 63 64 **Business logic errors (DO trigger circuit breaker):** 65 66 - Database errors (UNIQUE constraint, connection errors) 67 - Network timeouts (ETIMEDOUT, ECONNREFUSED) 68 - API errors (500, 502, 503) 69 - Runtime errors (null pointer, undefined) 70 71 See `src/agents/triage.js` for classification logic. 72 73 ## Manual Circuit Breaker Reset 74 75 ### Script: `reset-agent-circuit-breakers.js` 76 77 **Location:** `scripts/reset-agent-circuit-breakers.js` 78 79 **Usage:** 80 81 ```bash 82 # Dry run (show what would be reset) 83 node scripts/reset-agent-circuit-breakers.js --dry-run 84 85 # Reset circuit breakers older than 30 minutes 86 node scripts/reset-agent-circuit-breakers.js 87 88 # Force reset all circuit breakers immediately 89 node scripts/reset-agent-circuit-breakers.js --force 90 91 # Reset circuit breakers AND cleanup old failed tasks 92 node scripts/reset-agent-circuit-breakers.js --cleanup-tasks 93 94 # Dry run with all options 95 node scripts/reset-agent-circuit-breakers.js --dry-run --cleanup-tasks 96 ``` 97 98 **Options:** 99 100 - `--dry-run` - Show what would be reset without making changes 101 - `--cleanup-tasks` - Mark failed tasks older than 24 hours as cancelled 102 - `--force` - Reset all circuit breakers regardless of cooldown period 103 104 **Output:** 105 106 ``` 107 Agent Circuit Breaker Reset Tool 108 ================================= 109 110 Found 3 blocked agents: 111 112 developer: 113 - Triggered: 2026-02-15T09:22:15.000Z 114 - Age: 1439 minutes 115 - Failure rate: 35.7% 116 - Action: RESET (cooldown expired) 117 118 architect: 119 - Triggered: 2026-02-16T01:59:47.000Z 120 - Age: 15 minutes 121 - Failure rate: 44.0% 122 - Action: SKIP (cooldown expires in 15 minutes) 123 124 monitor: 125 - Triggered: 2026-02-15T09:22:15.000Z 126 - Age: 1439 minutes 127 - Failure rate: 35.3% 128 - Action: RESET (cooldown expired) 129 130 Reset 2/3 circuit breakers 131 132 Current Agent Status: 133 ===================== 134 135 developer: ✅ ACTIVE 136 - Success rate: 15.4% 137 - Failure rate: 35.7% 138 - Total tasks (24h): 26 139 140 architect: 🔴 BLOCKED 141 - Success rate: 41.7% 142 - Failure rate: 44.0% 143 - Total tasks (24h): 24 144 145 monitor: ✅ ACTIVE 146 - Success rate: 47.1% 147 - Failure rate: 35.3% 148 - Total tasks (24h): 17 149 ``` 150 151 ## Checking Circuit Breaker Status 152 153 ### SQL Queries 154 155 **Check blocked agents:** 156 157 ```sql 158 SELECT 159 agent_name, 160 status, 161 json_extract(metrics_json, '$.circuit_breaker_triggered_at') as triggered_at, 162 json_extract(metrics_json, '$.failure_rate') as failure_rate, 163 json_extract(metrics_json, '$.total_tasks_24h') as total_tasks 164 FROM agent_state 165 WHERE status = 'blocked'; 166 ``` 167 168 **Check agent metrics:** 169 170 ```sql 171 SELECT 172 agent_name, 173 status, 174 json_extract(metrics_json, '$.success_rate') as success_rate, 175 json_extract(metrics_json, '$.failure_rate') as failure_rate, 176 json_extract(metrics_json, '$.total_tasks_24h') as total_tasks_24h, 177 json_extract(metrics_json, '$.last_health_check') as last_health_check 178 FROM agent_state 179 ORDER BY agent_name; 180 ``` 181 182 **Check recent agent failures:** 183 184 ```sql 185 SELECT 186 id, 187 task_type, 188 assigned_to, 189 error_message, 190 created_at, 191 julianday('now') - julianday(created_at) as age_days 192 FROM agent_tasks 193 WHERE status = 'failed' 194 AND created_at > datetime('now', '-24 hours') 195 ORDER BY created_at DESC 196 LIMIT 20; 197 ``` 198 199 ### CLI Commands 200 201 **Check agent status:** 202 203 ```bash 204 npm run agent:list 205 ``` 206 207 **View failed tasks:** 208 209 ```bash 210 npm run agent:tasks -- --status failed 211 ``` 212 213 ## Task Cleanup 214 215 ### Old Failed Tasks 216 217 **Why cleanup:** 218 219 - Reduces noise in task queue 220 - Improves query performance 221 - Prevents stale task accumulation 222 223 **Cleanup criteria:** 224 225 - Status: `failed` 226 - Age: > 24 hours 227 - Action: Mark as `cancelled` with timestamp 228 229 **Manual cleanup:** 230 231 ```bash 232 node scripts/reset-agent-circuit-breakers.js --cleanup-tasks 233 ``` 234 235 **SQL cleanup:** 236 237 ```sql 238 -- Show old failed tasks 239 SELECT 240 id, task_type, assigned_to, error_message, 241 julianday('now') - julianday(created_at) as age_days 242 FROM agent_tasks 243 WHERE status = 'failed' 244 AND created_at < datetime('now', '-24 hours') 245 ORDER BY created_at ASC; 246 247 -- Mark as cancelled 248 UPDATE agent_tasks 249 SET status = 'cancelled', 250 error_message = COALESCE(error_message, '') || ' [Auto-cancelled after 24h]', 251 completed_at = datetime('now') 252 WHERE status = 'failed' 253 AND created_at < datetime('now', '-24 hours'); 254 ``` 255 256 ## Environment Variables 257 258 **Agent Circuit Breaker:** 259 260 - `AGENT_CIRCUIT_BREAKER_THRESHOLD` - Failure rate threshold (default: `0.3` = 30%) 261 - `AGENT_CIRCUIT_BREAKER_COOLDOWN` - Cooldown period in minutes (default: `30`) 262 263 **API Circuit Breakers:** 264 265 - Set via `src/utils/circuit-breaker.js` (not configurable via env) 266 267 ## Troubleshooting 268 269 ### Agent stuck in blocked state 270 271 **Symptoms:** 272 273 - Agent status = `blocked` 274 - Circuit breaker triggered hours/days ago 275 - Failure rate dropped but still blocked 276 277 **Solution:** 278 279 ```bash 280 # Check if auto-recovery should have happened 281 node scripts/reset-agent-circuit-breakers.js --dry-run 282 283 # Force reset if needed 284 node scripts/reset-agent-circuit-breakers.js --force 285 ``` 286 287 ### High failure rate persists 288 289 **Symptoms:** 290 291 - Agent resets but immediately triggers again 292 - Failure rate > 30% consistently 293 294 **Investigation:** 295 296 ```sql 297 -- Check recent failures 298 SELECT error_message, COUNT(*) as count 299 FROM agent_tasks 300 WHERE status = 'failed' 301 AND assigned_to = 'developer' 302 AND created_at > datetime('now', '-24 hours') 303 GROUP BY error_message 304 ORDER BY count DESC; 305 ``` 306 307 **Actions:** 308 309 1. Review error patterns in failed tasks 310 2. Check if errors are agent system errors (should go to Architect) 311 3. Verify error classification in `src/agents/triage.js` 312 4. Fix underlying issues before resetting circuit breaker 313 314 ### Circuit breaker logs 315 316 **View recovery logs:** 317 318 ```sql 319 SELECT 320 agent_name, 321 message, 322 metadata_json, 323 created_at 324 FROM agent_logs 325 WHERE message LIKE '%circuit breaker%' 326 ORDER BY created_at DESC 327 LIMIT 20; 328 ``` 329 330 ## Pre-Activation Checklist 331 332 **Before activating agent system (enabling cron):** 333 334 1. **Reset circuit breakers:** 335 336 ```bash 337 node scripts/reset-agent-circuit-breakers.js --cleanup-tasks 338 ``` 339 340 2. **Verify agent status:** 341 342 ```bash 343 npm run agent:list 344 ``` 345 346 3. **Check task queue:** 347 348 ```bash 349 npm run agent:tasks 350 ``` 351 352 4. **Review recent failures:** 353 354 ```sql 355 SELECT assigned_to, COUNT(*) as count 356 FROM agent_tasks 357 WHERE status = 'failed' 358 AND created_at > datetime('now', '-24 hours') 359 GROUP BY assigned_to; 360 ``` 361 362 5. **Enable cron if all clear:** 363 ```bash 364 # Set AGENT_SYSTEM_ENABLED=true in .env 365 systemctl restart 333method-agent 366 ``` 367 368 ## Best Practices 369 370 1. **Monitor agent health:** 371 - Check `npm run agent:list` daily 372 - Review failure rates weekly 373 - Investigate spikes in failures immediately 374 375 2. **Auto-recovery first:** 376 - Let circuit breakers auto-recover when possible 377 - Only force reset if truly necessary 378 - Document reason for force resets 379 380 3. **Cleanup regularly:** 381 - Run `--cleanup-tasks` weekly 382 - Keep failed task history under 1000 records 383 - Archive old agent_logs monthly 384 385 4. **Error classification:** 386 - Review `src/agents/triage.js` quarterly 387 - Add new agent error patterns as needed 388 - Ensure business logic errors counted correctly 389 390 5. **Testing:** 391 - Use `--dry-run` before production resets 392 - Test circuit breaker behavior in staging 393 - Verify auto-recovery works as expected 394 395 ## See Also 396 397 - [Agent System Documentation](agent-system.md) 398 - [Cron System Documentation](cron-system.md) 399 - [Circuit Breaker Implementation](../../src/utils/circuit-breaker.js) 400 - [Agent Runner](../../src/agents/runner.js)