/ AGENT_ACTIVATION_READY.md
AGENT_ACTIVATION_READY.md
  1  # Agent System Activation Checklist
  2  
  3  **Status:** Ready for Activation ✅
  4  **Date:** 2026-02-16
  5  **Prepared by:** Claude Code
  6  
  7  ## Summary
  8  
  9  The agent system circuit breakers have been analyzed and tools have been created to prepare for activation. This document provides a complete checklist for activating the agent system safely.
 10  
 11  ## Current State
 12  
 13  ### Blocked Agents
 14  
 15  Three agents currently have circuit breakers triggered:
 16  
 17  1. **Developer Agent**
 18     - Circuit breaker: 2026-02-15 09:22:15 (~24 hours old)
 19     - Failure rate: 35.7%
 20     - Should auto-recover: Yes (cooldown expired)
 21  
 22  2. **Architect Agent**
 23     - Circuit breaker: 2026-02-16 01:59:47 (~15 hours old)
 24     - Failure rate: 44.0%
 25     - Should auto-recover: Yes (cooldown expired)
 26  
 27  3. **Monitor Agent**
 28     - Circuit breaker: 2026-02-15 09:22:15 (~24 hours old)
 29     - Failure rate: 35.3%
 30     - Should auto-recover: Yes (cooldown expired)
 31  
 32  ### Other Agents (Active)
 33  
 34  - **Triage:** 65.2% success rate (healthy)
 35  - **QA:** 28.6% success rate, 71.4% failure rate (⚠️ high failures, but not blocked)
 36  - **Security:** 0% success rate, 50% failure rate (low volume, 6 tasks)
 37  
 38  ## Tools Created
 39  
 40  ### 1. Circuit Breaker Reset Script
 41  
 42  **File:** `scripts/reset-agent-circuit-breakers.js`
 43  
 44  **Features:**
 45  
 46  - ✅ Dry-run mode for safety
 47  - ✅ Auto-detects circuit breakers older than 30 minutes
 48  - ✅ Force reset option
 49  - ✅ Task cleanup (mark old failed tasks as cancelled)
 50  - ✅ Comprehensive status reporting
 51  
 52  **NPM Commands:**
 53  
 54  ```bash
 55  npm run agent:reset-breakers:dry-run  # Preview
 56  npm run agent:reset-breakers          # Reset (30+ min old)
 57  npm run agent:reset-breakers:force    # Force reset all + cleanup
 58  ```
 59  
 60  ### 2. Documentation
 61  
 62  Created comprehensive documentation:
 63  
 64  - ✅ `docs/06-automation/circuit-breaker-management.md` - Full circuit breaker guide
 65  - ✅ `CIRCUIT_BREAKER_RESET_SUMMARY.md` - Analysis and current state
 66  - ✅ `AGENT_ACTIVATION_READY.md` - This activation checklist
 67  - ✅ Updated `docs/06-automation/README.md`
 68  - ✅ Updated `README.md` with agent commands
 69  
 70  ## Activation Procedure
 71  
 72  ### Step 1: Reset Circuit Breakers
 73  
 74  ```bash
 75  # Preview what will be reset
 76  npm run agent:reset-breakers:dry-run
 77  
 78  # Reset circuit breakers and cleanup old tasks
 79  npm run agent:reset-breakers:force
 80  ```
 81  
 82  **Expected outcome:**
 83  
 84  - Developer, Architect, Monitor circuit breakers reset
 85  - Old failed tasks (>24 hours) marked as cancelled
 86  - All agents should show status "✅ ACTIVE"
 87  
 88  ### Step 2: Verify Agent Health
 89  
 90  ```bash
 91  # Check agent status
 92  npm run agent:list
 93  
 94  # Check pending tasks
 95  npm run agent:tasks
 96  
 97  # Check recent failures
 98  npm run agent:stats
 99  ```
100  
101  **Success criteria:**
102  
103  - All agents show status "✅ ACTIVE" (not "🔴 BLOCKED")
104  - No circuit breaker timestamps in recent agents
105  - Task queue shows reasonable numbers
106  
107  ### Step 3: Investigate High Failure Rates
108  
109  **QA Agent (71.4% failure rate):**
110  
111  ```bash
112  # Review recent QA failures
113  sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'qa' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;"
114  ```
115  
116  **Developer Agent (35.7% failure rate):**
117  
118  ```bash
119  # Review recent Developer failures
120  sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'developer' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;"
121  ```
122  
123  **Architect Agent (44.0% failure rate):**
124  
125  ```bash
126  # Review recent Architect failures
127  sqlite3 db/sites.db "SELECT error_message, COUNT(*) FROM agent_tasks WHERE status = 'failed' AND assigned_to = 'architect' AND created_at > datetime('now', '-24 hours') GROUP BY error_message;"
128  ```
129  
130  **Note:** If failures are primarily "agent system errors" (unknown task types, validation), these should NOT trigger circuit breakers and may indicate routing issues.
131  
132  ### Step 4: Enable Agent System
133  
134  **Option A: Environment Variable**
135  
136  ```bash
137  # Edit .env file
138  echo "AGENT_SYSTEM_ENABLED=true" >> .env
139  ```
140  
141  **Option B: Systemd Service (NixOS)**
142  
143  ```bash
144  # Restart agent service
145  sudo systemctl restart 333method-agent
146  
147  # Check service status
148  sudo systemctl status 333method-agent
149  
150  # View logs
151  sudo journalctl -u 333method-agent -f
152  ```
153  
154  **Option C: Cron (Manual)**
155  
156  ```bash
157  # Enable agent cron job
158  npm run cron:enable agent-health-check
159  
160  # Verify enabled
161  npm run cron:list
162  ```
163  
164  ### Step 5: Monitor Closely (First 24 Hours)
165  
166  **Hourly checks:**
167  
168  ```bash
169  # Check agent health
170  npm run agent:list
171  
172  # Check new failures
173  npm run agent:stats
174  ```
175  
176  **Daily checks:**
177  
178  ```bash
179  # Review failure patterns
180  sqlite3 db/sites.db "SELECT assigned_to, COUNT(*) as failures FROM agent_tasks WHERE status = 'failed' AND created_at > datetime('now', '-24 hours') GROUP BY assigned_to;"
181  
182  # Check circuit breaker logs
183  sqlite3 db/sites.db "SELECT * FROM agent_logs WHERE message LIKE '%circuit breaker%' ORDER BY created_at DESC LIMIT 10;"
184  ```
185  
186  ## Circuit Breaker Auto-Recovery
187  
188  The system has built-in auto-recovery (no manual intervention needed):
189  
190  **Auto-Recovery Conditions:**
191  
192  1. Cooldown period expired (30 minutes)
193  2. Failure rate dropped below threshold (30%)
194  
195  **How it works:**
196  
197  - Every health check (every 5 minutes), system checks blocked agents
198  - If both conditions met, agent automatically transitions to "active"
199  - Recovery logged to `agent_logs` table
200  - Metrics updated with recovery timestamp
201  
202  **Manual override:**
203  
204  - Use `npm run agent:reset-breakers:force` to bypass cooldown
205  - Should only be needed if auto-recovery fails
206  
207  ## Troubleshooting
208  
209  ### Circuit Breaker Won't Reset
210  
211  **Symptoms:** Agent stays blocked after running reset script
212  
213  **Solutions:**
214  
215  1. Check if failure rate is still high (>30%):
216     ```bash
217     npm run agent:stats
218     ```
219  2. Force reset:
220     ```bash
221     npm run agent:reset-breakers:force
222     ```
223  3. Check for systemic issues causing failures
224  
225  ### High Failure Rate Persists
226  
227  **Symptoms:** Agent resets but immediately triggers again
228  
229  **Investigation:**
230  
231  1. Review error patterns:
232     ```sql
233     SELECT error_message, COUNT(*) as count
234     FROM agent_tasks
235     WHERE status = 'failed' AND assigned_to = 'developer'
236     GROUP BY error_message
237     ORDER BY count DESC;
238     ```
239  2. Check if errors are agent system errors (should NOT trigger circuit breaker)
240  3. Verify error classification in `src/agents/triage.js`
241  4. Fix underlying issues before enabling system
242  
243  ### Agent System Disabled
244  
245  **Symptoms:** Agents not processing tasks
246  
247  **Check:**
248  
249  1. Environment variable: `echo $AGENT_SYSTEM_ENABLED`
250  2. Cron job status: `npm run cron:list`
251  3. Systemd service: `systemctl status 333method-agent`
252  
253  ## Rollback Plan
254  
255  If activation causes issues:
256  
257  ### Immediate Rollback
258  
259  ```bash
260  # Disable agent system
261  echo "AGENT_SYSTEM_ENABLED=false" >> .env
262  
263  # Stop systemd service (if using)
264  sudo systemctl stop 333method-agent
265  
266  # Disable cron (if using)
267  npm run cron:disable agent-health-check
268  ```
269  
270  ### Investigation
271  
272  ```bash
273  # Review recent agent activity
274  npm run agent:logs
275  
276  # Check failure patterns
277  npm run agent:stats
278  
279  # Examine specific failed tasks
280  sqlite3 db/sites.db "SELECT * FROM agent_tasks WHERE status = 'failed' ORDER BY created_at DESC LIMIT 20;"
281  ```
282  
283  ### Re-enable (After Fixes)
284  
285  ```bash
286  # Fix underlying issues
287  # ...
288  
289  # Reset circuit breakers
290  npm run agent:reset-breakers:force
291  
292  # Verify health
293  npm run agent:list
294  
295  # Re-enable
296  echo "AGENT_SYSTEM_ENABLED=true" >> .env
297  ```
298  
299  ## Success Metrics
300  
301  **After 24 hours, check:**
302  
303  1. **Failure rates trending down:**
304     - Developer: <30%
305     - QA: <30%
306     - Architect: <30%
307  
308  2. **No circuit breaker triggers:**
309  
310     ```bash
311     sqlite3 db/sites.db "SELECT * FROM agent_logs WHERE message LIKE '%circuit breaker%' AND created_at > datetime('now', '-24 hours');"
312     ```
313  
314  3. **Tasks completing successfully:**
315  
316     ```bash
317     npm run agent:stats
318     ```
319  
320  4. **System stability:**
321     - No runaway API costs
322     - No cascading failures
323     - Agents processing tasks as expected
324  
325  ## Configuration
326  
327  ### Environment Variables
328  
329  ```bash
330  # Circuit breaker thresholds
331  AGENT_CIRCUIT_BREAKER_THRESHOLD=0.3      # 30% failure rate (default)
332  AGENT_CIRCUIT_BREAKER_COOLDOWN=30        # 30 minutes cooldown (default)
333  
334  # Agent system
335  AGENT_SYSTEM_ENABLED=true                # Enable agent system
336  AGENT_MAX_INVOCATIONS_PER_HOUR=60        # Rate limit (default: 60)
337  
338  # Logging
339  LOG_LEVEL=info                           # Log level (debug|info|warn|error)
340  ```
341  
342  ### Tuning Recommendations
343  
344  **If circuit breakers trigger too often:**
345  
346  - Increase threshold: `AGENT_CIRCUIT_BREAKER_THRESHOLD=0.4` (40%)
347  - Increase cooldown: `AGENT_CIRCUIT_BREAKER_COOLDOWN=60` (60 minutes)
348  
349  **If circuit breakers don't trigger when needed:**
350  
351  - Decrease threshold: `AGENT_CIRCUIT_BREAKER_THRESHOLD=0.2` (20%)
352  - Decrease cooldown: `AGENT_CIRCUIT_BREAKER_COOLDOWN=15` (15 minutes)
353  
354  ## Documentation
355  
356  **Comprehensive guides:**
357  
358  - [Agent System](docs/06-automation/agent-system.md) - Full agent architecture
359  - [Circuit Breaker Management](docs/06-automation/circuit-breaker-management.md) - Detailed circuit breaker guide
360  - [Cron System](docs/06-automation/cron-system.md) - Cron job setup and monitoring
361  
362  **Quick references:**
363  
364  - [README.md](README.md) - Agent commands and usage
365  - [CIRCUIT_BREAKER_RESET_SUMMARY.md](CIRCUIT_BREAKER_RESET_SUMMARY.md) - Current state analysis
366  
367  ## Support
368  
369  **If issues arise:**
370  
371  1. Review logs: `npm run agent:logs`
372  2. Check documentation above
373  3. Disable system if needed (rollback plan)
374  4. Investigate and fix root causes
375  5. Re-enable after verification
376  
377  ## Sign-Off
378  
379  - [x] Circuit breakers analyzed
380  - [x] Reset tools created
381  - [x] Documentation written
382  - [x] Commands added to package.json
383  - [x] README.md updated
384  - [x] Activation procedure documented
385  - [x] Rollback plan prepared
386  - [x] Monitoring plan established
387  
388  **Ready for activation!** 🚀
389  
390  Run `npm run agent:reset-breakers:force` to begin.