/ AGENT_ACTIVATION_READY.md
AGENT_ACTIVATION_READY.md
1 # Agent System Activation Checklist 2 3 **Status:** Ready for Activation ✅ 4 **Date:** 2026-02-16 5 **Prepared by:** Claude Code 6 7 ## Summary 8 9 The agent system circuit breakers have been analyzed and tools have been created to prepare for activation. This document provides a complete checklist for activating the agent system safely. 10 11 ## Current State 12 13 ### Blocked Agents 14 15 Three agents currently have circuit breakers triggered: 16 17 1. **Developer Agent** 18 - Circuit breaker: 2026-02-15 09:22:15 (~24 hours old) 19 - Failure rate: 35.7% 20 - Should auto-recover: Yes (cooldown expired) 21 22 2. **Architect Agent** 23 - Circuit breaker: 2026-02-16 01:59:47 (~15 hours old) 24 - Failure rate: 44.0% 25 - Should auto-recover: Yes (cooldown expired) 26 27 3. **Monitor Agent** 28 - Circuit breaker: 2026-02-15 09:22:15 (~24 hours old) 29 - Failure rate: 35.3% 30 - Should auto-recover: Yes (cooldown expired) 31 32 ### Other Agents (Active) 33 34 - **Triage:** 65.2% success rate (healthy) 35 - **QA:** 28.6% success rate, 71.4% failure rate (⚠️ high failures, but not blocked) 36 - **Security:** 0% success rate, 50% failure rate (low volume, 6 tasks) 37 38 ## Tools Created 39 40 ### 1. Circuit Breaker Reset Script 41 42 **File:** `scripts/reset-agent-circuit-breakers.js` 43 44 **Features:** 45 46 - ✅ Dry-run mode for safety 47 - ✅ Auto-detects circuit breakers older than 30 minutes 48 - ✅ Force reset option 49 - ✅ Task cleanup (mark old failed tasks as cancelled) 50 - ✅ Comprehensive status reporting 51 52 **NPM Commands:** 53 54 ```bash 55 npm run agent:reset-breakers:dry-run # Preview 56 npm run agent:reset-breakers # Reset (30+ min old) 57 npm run agent:reset-breakers:force # Force reset all + cleanup 58 ``` 59 60 ### 2. Documentation 61 62 Created comprehensive documentation: 63 64 - ✅ `docs/06-automation/circuit-breaker-management.md` - Full circuit breaker guide 65 - ✅ `CIRCUIT_BREAKER_RESET_SUMMARY.md` - Analysis and current state 66 - ✅ `AGENT_ACTIVATION_READY.md` - This activation checklist 67 - ✅ Updated `docs/06-automation/README.md` 68 - ✅ Updated `README.md` with agent commands 69 70 ## Activation Procedure 71 72 ### Step 1: Reset Circuit Breakers 73 74 ```bash 75 # Preview what will be reset 76 npm run agent:reset-breakers:dry-run 77 78 # Reset circuit breakers and cleanup old tasks 79 npm run agent:reset-breakers:force 80 ``` 81 82 **Expected outcome:** 83 84 - Developer, Architect, Monitor circuit breakers reset 85 - Old failed tasks (>24 hours) marked as cancelled 86 - All agents should show status "✅ ACTIVE" 87 88 ### Step 2: Verify Agent Health 89 90 ```bash 91 # Check agent status 92 npm run agent:list 93 94 # Check pending tasks 95 npm run agent:tasks 96 97 # Check recent failures 98 npm run agent:stats 99 ``` 100 101 **Success criteria:** 102 103 - All agents show status "✅ ACTIVE" (not "🔴 BLOCKED") 104 - No circuit breaker timestamps in recent agents 105 - Task queue shows reasonable numbers 106 107 ### Step 3: Investigate High Failure Rates 108 109 **QA Agent (71.4% failure rate):** 110 111 ```bash 112 # Review recent QA failures 113 sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'qa' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;" 114 ``` 115 116 **Developer Agent (35.7% failure rate):** 117 118 ```bash 119 # Review recent Developer failures 120 sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'developer' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;" 121 ``` 122 123 **Architect Agent (44.0% failure rate):** 124 125 ```bash 126 # Review recent Architect failures 127 sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'architect' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;" 128 ``` 129 130 **Note:** If failures are primarily "agent system errors" (unknown task types, validation), these should NOT trigger circuit breakers and may indicate routing issues. 131 132 ### Step 4: Enable Agent System 133 134 **Option A: Environment Variable** 135 136 ```bash 137 # Edit .env file 138 echo "AGENT_SYSTEM_ENABLED=true" >> .env 139 ``` 140 141 **Option B: Systemd Service (NixOS)** 142 143 ```bash 144 # Restart agent service 145 sudo systemctl restart 333method-agent 146 147 # Check service status 148 sudo systemctl status 333method-agent 149 150 # View logs 151 sudo journalctl -u 333method-agent -f 152 ``` 153 154 **Option C: Cron (Manual)** 155 156 ```bash 157 # Enable agent cron job 158 npm run cron:enable agent-health-check 159 160 # Verify enabled 161 npm run cron:list 162 ``` 163 164 ### Step 5: Monitor Closely (First 24 Hours) 165 166 **Hourly checks:** 167 168 ```bash 169 # Check agent health 170 npm run agent:list 171 172 # Check new failures 173 npm run agent:stats 174 ``` 175 176 **Daily checks:** 177 178 ```bash 179 # Review failure patterns 180 sqlite3 db/sites.db "SELECT assigned_to, COUNT(*) as failures FROM agent_tasks WHERE status = 'failed' AND created_at > datetime('now', '-24 hours') GROUP BY assigned_to;" 181 182 # Check circuit breaker logs 183 sqlite3 db/sites.db "SELECT * FROM agent_logs WHERE message LIKE '%circuit breaker%' ORDER BY created_at DESC LIMIT 10;" 184 ``` 185 186 ## Circuit Breaker Auto-Recovery 187 188 The system has built-in auto-recovery (no manual intervention needed): 189 190 **Auto-Recovery Conditions:** 191 192 1. Cooldown period expired (30 minutes) 193 2. Failure rate dropped below threshold (30%) 194 195 **How it works:** 196 197 - Every health check (every 5 minutes), system checks blocked agents 198 - If both conditions met, agent automatically transitions to "active" 199 - Recovery logged to `agent_logs` table 200 - Metrics updated with recovery timestamp 201 202 **Manual override:** 203 204 - Use `npm run agent:reset-breakers:force` to bypass cooldown 205 - Should only be needed if auto-recovery fails 206 207 ## Troubleshooting 208 209 ### Circuit Breaker Won't Reset 210 211 **Symptoms:** Agent stays blocked after running reset script 212 213 **Solutions:** 214 215 1. Check if failure rate is still high (>30%): 216 ```bash 217 npm run agent:stats 218 ``` 219 2. Force reset: 220 ```bash 221 npm run agent:reset-breakers:force 222 ``` 223 3. Check for systemic issues causing failures 224 225 ### High Failure Rate Persists 226 227 **Symptoms:** Agent resets but immediately triggers again 228 229 **Investigation:** 230 231 1. Review error patterns: 232 ```sql 233 SELECT error_message, COUNT(*) as count 234 FROM agent_tasks 235 WHERE status = 'failed' AND assigned_to = 'developer' 236 GROUP BY error_message 237 ORDER BY count DESC; 238 ``` 239 2. Check if errors are agent system errors (should NOT trigger circuit breaker) 240 3. Verify error classification in `src/agents/triage.js` 241 4. Fix underlying issues before enabling system 242 243 ### Agent System Disabled 244 245 **Symptoms:** Agents not processing tasks 246 247 **Check:** 248 249 1. Environment variable: `echo $AGENT_SYSTEM_ENABLED` 250 2. Cron job status: `npm run cron:list` 251 3. Systemd service: `systemctl status 333method-agent` 252 253 ## Rollback Plan 254 255 If activation causes issues: 256 257 ### Immediate Rollback 258 259 ```bash 260 # Disable agent system 261 echo "AGENT_SYSTEM_ENABLED=false" >> .env 262 263 # Stop systemd service (if using) 264 sudo systemctl stop 333method-agent 265 266 # Disable cron (if using) 267 npm run cron:disable agent-health-check 268 ``` 269 270 ### Investigation 271 272 ```bash 273 # Review recent agent activity 274 npm run agent:logs 275 276 # Check failure patterns 277 npm run agent:stats 278 279 # Examine specific failed tasks 280 sqlite3 db/sites.db "SELECT * FROM agent_tasks WHERE status = 'failed' ORDER BY created_at DESC LIMIT 20;" 281 ``` 282 283 ### Re-enable (After Fixes) 284 285 ```bash 286 # Fix underlying issues 287 # ... 288 289 # Reset circuit breakers 290 npm run agent:reset-breakers:force 291 292 # Verify health 293 npm run agent:list 294 295 # Re-enable 296 echo "AGENT_SYSTEM_ENABLED=true" >> .env 297 ``` 298 299 ## Success Metrics 300 301 **After 24 hours, check:** 302 303 1. **Failure rates trending down:** 304 - Developer: <30% 305 - QA: <30% 306 - Architect: <30% 307 308 2. **No circuit breaker triggers:** 309 310 ```bash 311 sqlite3 db/sites.db "SELECT * FROM agent_logs WHERE message LIKE '%circuit breaker%' AND created_at > datetime('now', '-24 hours');" 312 ``` 313 314 3. **Tasks completing successfully:** 315 316 ```bash 317 npm run agent:stats 318 ``` 319 320 4. **System stability:** 321 - No runaway API costs 322 - No cascading failures 323 - Agents processing tasks as expected 324 325 ## Configuration 326 327 ### Environment Variables 328 329 ```bash 330 # Circuit breaker thresholds 331 AGENT_CIRCUIT_BREAKER_THRESHOLD=0.3 # 30% failure rate (default) 332 AGENT_CIRCUIT_BREAKER_COOLDOWN=30 # 30 minutes cooldown (default) 333 334 # Agent system 335 AGENT_SYSTEM_ENABLED=true # Enable agent system 336 AGENT_MAX_INVOCATIONS_PER_HOUR=60 # Rate limit (default: 60) 337 338 # Logging 339 LOG_LEVEL=info # Log level (debug|info|warn|error) 340 ``` 341 342 ### Tuning Recommendations 343 344 **If circuit breakers trigger too often:** 345 346 - Increase threshold: `AGENT_CIRCUIT_BREAKER_THRESHOLD=0.4` (40%) 347 - Increase cooldown: `AGENT_CIRCUIT_BREAKER_COOLDOWN=60` (60 minutes) 348 349 **If circuit breakers don't trigger when needed:** 350 351 - Decrease threshold: `AGENT_CIRCUIT_BREAKER_THRESHOLD=0.2` (20%) 352 - Decrease cooldown: `AGENT_CIRCUIT_BREAKER_COOLDOWN=15` (15 minutes) 353 354 ## Documentation 355 356 **Comprehensive guides:** 357 358 - [Agent System](docs/06-automation/agent-system.md) - Full agent architecture 359 - [Circuit Breaker Management](docs/06-automation/circuit-breaker-management.md) - Detailed circuit breaker guide 360 - [Cron System](docs/06-automation/cron-system.md) - Cron job setup and monitoring 361 362 **Quick references:** 363 364 - [README.md](README.md) - Agent commands and usage 365 - [CIRCUIT_BREAKER_RESET_SUMMARY.md](CIRCUIT_BREAKER_RESET_SUMMARY.md) - Current state analysis 366 367 ## Support 368 369 **If issues arise:** 370 371 1. Review logs: `npm run agent:logs` 372 2. Check documentation above 373 3. Disable system if needed (rollback plan) 374 4. Investigate and fix root causes 375 5. Re-enable after verification 376 377 ## Sign-Off 378 379 - [x] Circuit breakers analyzed 380 - [x] Reset tools created 381 - [x] Documentation written 382 - [x] Commands added to package.json 383 - [x] README.md updated 384 - [x] Activation procedure documented 385 - [x] Rollback plan prepared 386 - [x] Monitoring plan established 387 388 **Ready for activation!** 🚀 389 390 Run `npm run agent:reset-breakers:force` to begin.