E2E-TEST-DOCUMENTATION.md
1 # E2E Agent System Test Suite Documentation 2 3 ## Overview 4 5 This test suite (`e2e-agent-system.test.js`) provides comprehensive end-to-end validation that the agent system is production-ready. It covers all critical workflows, error handling, circuit breakers, and inter-agent communication. 6 7 ## Test Coverage 8 9 ### 1. Task Lifecycle (2 tests) 10 11 **Purpose**: Verify tasks flow correctly through pending → running → completed states. 12 13 - **Test 1.1**: Complete task lifecycle 14 - Creates classify_error task for Triage 15 - Verifies task transitions through statuses correctly 16 - Validates result_json contains classification and routing info 17 18 - **Test 1.2**: Task status transitions 19 - Tracks status changes during processing 20 - Verifies agent_logs contain start/complete events 21 - Ensures timestamps are set correctly 22 23 **Why it matters**: Core task lifecycle is the foundation of the entire agent system. If tasks don't transition correctly, the whole system breaks down. 24 25 --- 26 27 ### 2. Inter-agent Communication (2 tests) 28 29 **Purpose**: Ensure agents can collaborate by creating tasks and sending messages. 30 31 - **Test 2.1**: Developer creates task → QA reviews 32 - Developer fixes bug and creates QA verification task 33 - Verifies handoff message sent from Developer to QA 34 - Validates parent-child task relationship 35 - **Mocks**: file operations, test runner, LLM calls, git commits 36 37 - **Test 2.2**: Agents ask questions and receive answers 38 - Developer asks Triage for clarification 39 - Triage sends answer with metadata linking to question 40 - Verifies unread message retrieval works 41 42 **Why it matters**: Multi-agent workflows require reliable communication. Without working handoffs, tasks get stuck and manual intervention is required. 43 44 --- 45 46 ### 3. Error Handling (3 tests) 47 48 **Purpose**: Verify graceful degradation when things go wrong. 49 50 - **Test 3.1**: Invalid context → graceful failure 51 - Task missing required field (error_message) 52 - Should fail with descriptive error message 53 - Error logged to agent_logs table 54 55 - **Test 3.2**: Malformed JSON → graceful handling 56 - Task with invalid JSON in context_json 57 - Agent should handle gracefully (context_json is optional) 58 59 - **Test 3.3**: Retry logic (3 attempts → failed) 60 - Task fails 3 times (missing file) 61 - After 3 retries, marked as 'failed' with clear message 62 - Retry count tracked correctly 63 64 **Why it matters**: Production systems must handle errors gracefully. Cryptic failures or infinite loops are unacceptable. 65 66 --- 67 68 ### 4. Circuit Breaker (2 tests) 69 70 **Purpose**: Prevent cascade failures when external services fail. 71 72 - **Test 4.1**: Multiple failures → circuit opens 73 - Simulates 5 consecutive API failures 74 - Circuit breaker state transitions to 'open' 75 - Failure count tracked correctly 76 77 - **Test 4.2**: Circuit auto-recovers after cooldown 78 - Circuit opened 35 minutes ago 79 - After 30-minute cooldown, transitions to 'half_open' 80 - Successful request closes circuit and resets failure count 81 82 **Why it matters**: Prevents wasting credits on failing APIs and allows graceful degradation instead of cascade failures. 83 84 --- 85 86 ### 5. Task Routing (3 tests) 87 88 **Purpose**: Verify Triage routes errors to correct agents. 89 90 - **Test 5.1**: Security error → Security agent 91 - Unauthorized/security-related error 92 - Routed to Security agent with priority 10 (critical) 93 94 - **Test 5.2**: Database constraint → Developer agent 95 - UNIQUE constraint violation 96 - Routed to Developer with suggested fix 97 98 - **Test 5.3**: Network error → Architect agent 99 - ETIMEDOUT infrastructure issue 100 - Routed to Architect (not Developer) 101 102 **Why it matters**: Proper routing ensures tasks reach the right expert, avoiding wasted effort and incorrect fixes. 103 104 --- 105 106 ### 6. Priority Handling (2 tests) 107 108 **Purpose**: Ensure high-priority tasks get processed first. 109 110 - **Test 6.1**: High priority tasks first 111 - Creates tasks with priorities 3, 9, 5 112 - Processes one at a time 113 - Verifies priority 9 runs first, then 5, then 3 114 115 - **Test 6.2**: Priority calculation 116 - Security error in early stage (scoring) 117 - Should calculate priority >= 8 118 119 **Why it matters**: Critical errors (security, data loss) must be fixed before low-priority bugs. Priority scheduling prevents starvation of important work. 120 121 --- 122 123 ### 7. Row-level Locking (1 test) 124 125 **Purpose**: Allow horizontal scaling without race conditions. 126 127 - **Test 7.1**: Concurrent agents don't claim same task 128 - Enables row-level locking and horizontal scaling 129 - Creates 2 agent instances 130 - Both try to claim same task simultaneously 131 - Only one succeeds (count1 + count2 == 1) 132 - Task completed exactly once 133 134 **Why it matters**: Horizontal scaling allows processing more tasks in parallel. Without row-level locking, agents would duplicate work or corrupt state. 135 136 --- 137 138 ### 8. Known Error Database (1 test) 139 140 **Purpose**: Reuse fixes from similar past errors. 141 142 - **Test 8.1**: Similar errors get suggested fixes 143 - Creates completed fix_bug task (known fix) 144 - Creates new error with same pattern (different line number) 145 - Triage detects known fix (similarity >= 70%) 146 - Routed task includes suggested fix 147 148 **Why it matters**: Reduces LLM calls and fixes errors faster by learning from past successes. 70%+ similarity threshold avoids false matches. 149 150 --- 151 152 ### 9. Coverage Gates (1 test) 153 154 **Purpose**: Enforce 85% coverage before commits (hard gate). 155 156 - **Test 9.1**: Developer enforces 85% coverage 157 - Developer attempts to commit with 70% coverage 158 - Commit blocked, task marked 'blocked' 159 - QA task created to write tests 160 - **Mocks**: file ops, test runner (returns 70% coverage), LLM, git 161 162 **Why it matters**: Coverage gates prevent technical debt accumulation. 85% is the threshold for production-quality code. 163 164 --- 165 166 ### 10. Workflow Dependencies (2 tests) 167 168 **Purpose**: Enforce proper workflows (design → approval → implementation). 169 170 - **Test 10.1**: Features require approved design 171 - implement_feature task without approved design 172 - Should block and auto-create design_proposal task 173 174 - **Test 10.2**: Approved design enables implementation 175 - Creates approved design_proposal task 176 - Feature implementation succeeds (not blocked) 177 - **Mocks**: file ops, tests (90% coverage), LLM, git 178 179 **Why it matters**: Follows TOGAF/SRE best practices. Design reviews prevent architecture erosion and catch issues before implementation. 180 181 --- 182 183 ### 11. Full Multi-Agent Workflow (1 test) 184 185 **Purpose**: Integration test covering Monitor → Triage → Developer → QA. 186 187 - **Test 11.1**: Full bug fix workflow 188 - Monitor detects error (simulated) 189 - Triage classifies and routes 190 - Developer fixes bug (mocked) 191 - QA task created for verification 192 - Verifies complete workflow chain 193 194 **Why it matters**: End-to-end validation that all agents work together correctly. This is the closest test to real-world production usage. 195 196 --- 197 198 ## Mocking Strategy 199 200 ### Why Mock? 201 202 - **Speed**: Tests run in seconds instead of minutes 203 - **Reliability**: No external API dependencies 204 - **Cost**: Zero LLM credits spent on tests 205 - **Determinism**: Tests don't fail due to external service issues 206 207 ### What's Mocked? 208 209 1. **File Operations** (`file-operations.js`) 210 - `readFile()` - Returns mock file content 211 - `editFile()` - Returns mock backup path and diff 212 - `writeFile()` - Returns mock backup path 213 - `getFileContext()` - Returns mock imports/test files 214 215 2. **Test Runner** (`test-runner.js`) 216 - `runTestsForFile()` - Returns mock test results 217 - `runTests()` - Returns mock coverage data 218 - Allows simulating pass/fail scenarios and coverage percentages 219 220 3. **LLM API** (`agent-claude-api.js`) 221 - `simpleLLMCall()` - Returns mock JSON fixes/implementations 222 - Avoids hitting OpenRouter API during tests 223 - Returns deterministic responses for assertions 224 225 4. **Git Commands** (`child_process.execSync`) 226 - Returns mock commit hashes 227 - Avoids actual git commits during tests 228 229 5. **Filesystem** (`fs.readFile` for coverage data) 230 - Returns mock coverage-summary.json 231 - Allows testing coverage gate logic 232 233 ### What's NOT Mocked? 234 235 - Database operations (uses real SQLite in-memory DB) 236 - Agent logic (BaseAgent, TriageAgent, DeveloperAgent, QAAgent) 237 - Task manager (task creation, status transitions) 238 - Message manager (inter-agent messaging) 239 240 --- 241 242 ## Running the Tests 243 244 ### Full Suite 245 246 ```bash 247 npm test tests/agents/e2e-agent-system.test.js 248 ``` 249 250 ### Individual Test Group 251 252 ```bash 253 npm test tests/agents/e2e-agent-system.test.js -- --grep "Task Lifecycle" 254 npm test tests/agents/e2e-agent-system.test.js -- --grep "Inter-agent Communication" 255 npm test tests/agents/e2e-agent-system.test.js -- --grep "Circuit Breaker" 256 ``` 257 258 ### Debug Mode 259 260 ```bash 261 DEBUG=1 npm test tests/agents/e2e-agent-system.test.js 262 ``` 263 264 --- 265 266 ## Expected Results 267 268 ### Success Criteria 269 270 - All 18 tests pass 271 - No unhandled promise rejections 272 - No database connection leaks 273 - Test database cleaned up after each test 274 275 ### Coverage Goals 276 277 - Agent system core: 80%+ line coverage 278 - Task lifecycle: 90%+ coverage 279 - Error handling: 85%+ coverage 280 281 --- 282 283 ## Troubleshooting 284 285 ### Test Failures 286 287 **"AssertionError: task not completed"** 288 289 - Check if agent initialized correctly (`await agent.initialize()`) 290 - Verify task has required context fields 291 - Check agent_logs table for error details 292 293 **"Database locked"** 294 295 - Ensure afterEach() closes database connections 296 - Call `resetBaseDb()`, `resetTaskDb()`, `resetMessageDb()` 297 - Check for concurrent test execution 298 299 **"Mock not called"** 300 301 - Verify mock.method() is called before agent processes task 302 - Check if import path matches exactly 303 - Ensure mock is restored in afterEach() 304 305 ### Performance Issues 306 307 **Tests timeout** 308 309 - Increase timeout: `test('name', { timeout: 60000 }, async () => {...})` 310 - Check for infinite loops in agent logic 311 - Verify immediate invocation is disabled (`AGENT_IMMEDIATE_INVOCATION=false`) 312 313 **Database pollution** 314 315 - Ensure each test uses fresh database 316 - Verify TEST_DB_PATH is unique per test file 317 - Check afterEach() cleanup logic 318 319 --- 320 321 ## Future Enhancements 322 323 ### Additional Test Scenarios 324 325 1. **Agent Deadlock Detection** - Circular dependencies between agents 326 2. **Task Timeout Handling** - Tasks stuck in 'running' state for >1 hour 327 3. **Resource Limits** - Agent system behavior under token budget constraints 328 4. **Concurrent Workflow Execution** - Multiple bug fixes in parallel 329 5. **Partial Rollback** - Undo changes when QA rejects fix 330 331 ### Integration Tests (Real APIs) 332 333 - Mark with `.integration.test.js` suffix 334 - Run separately: `npm run test:integration` 335 - Use real LLM calls (small budget) 336 - Validate actual code generation quality 337 338 ### Load Tests 339 340 - Simulate 100+ concurrent tasks 341 - Verify row-level locking under load 342 - Measure task throughput (tasks/minute) 343 - Test database performance at scale 344 345 --- 346 347 ## Contributing 348 349 ### Adding New Tests 350 351 1. Follow existing structure (describe → test) 352 2. Use descriptive test names (what + why) 353 3. Add mocks for expensive operations 354 4. Document test purpose in comments 355 5. Verify cleanup in afterEach() 356 357 ### Test Organization 358 359 - Group related tests in describe() blocks 360 - Order by complexity (simple → complex) 361 - Keep test cases independent (no shared state) 362 - Use setup/teardown for common initialization 363 364 ### Assertions 365 366 - Use specific assertions (`strictEqual` vs `ok`) 367 - Provide descriptive failure messages 368 - Test both positive and negative cases 369 - Verify side effects (logs, messages, child tasks) 370 371 --- 372 373 ## References 374 375 - Agent Architecture: `/docs/06-automation/agent-system.md` 376 - Task Manager: `/src/agents/utils/task-manager.js` 377 - Message Manager: `/src/agents/utils/message-manager.js` 378 - Base Agent: `/src/agents/base-agent.js`