Cradicle Explorer

/ tests / agents / E2E-TEST-DOCUMENTATION.md
E2E-TEST-DOCUMENTATION.md
  1  # E2E Agent System Test Suite Documentation
  2  
  3  ## Overview
  4  
  5  This test suite (`e2e-agent-system.test.js`) provides comprehensive end-to-end validation that the agent system is production-ready. It covers all critical workflows, error handling, circuit breakers, and inter-agent communication.
  6  
  7  ## Test Coverage
  8  
  9  ### 1. Task Lifecycle (2 tests)
 10  
 11  **Purpose**: Verify tasks flow correctly through pending → running → completed states.
 12  
 13  - **Test 1.1**: Complete task lifecycle
 14    - Creates classify_error task for Triage
 15    - Verifies task transitions through statuses correctly
 16    - Validates result_json contains classification and routing info
 17  
 18  - **Test 1.2**: Task status transitions
 19    - Tracks status changes during processing
 20    - Verifies agent_logs contain start/complete events
 21    - Ensures timestamps are set correctly
 22  
 23  **Why it matters**: Core task lifecycle is the foundation of the entire agent system. If tasks don't transition correctly, the whole system breaks down.
 24  
 25  ---
 26  
 27  ### 2. Inter-agent Communication (2 tests)
 28  
 29  **Purpose**: Ensure agents can collaborate by creating tasks and sending messages.
 30  
 31  - **Test 2.1**: Developer creates task → QA reviews
 32    - Developer fixes bug and creates QA verification task
 33    - Verifies handoff message sent from Developer to QA
 34    - Validates parent-child task relationship
 35    - **Mocks**: file operations, test runner, LLM calls, git commits
 36  
 37  - **Test 2.2**: Agents ask questions and receive answers
 38    - Developer asks Triage for clarification
 39    - Triage sends answer with metadata linking to question
 40    - Verifies unread message retrieval works
 41  
 42  **Why it matters**: Multi-agent workflows require reliable communication. Without working handoffs, tasks get stuck and manual intervention is required.
 43  
 44  ---
 45  
 46  ### 3. Error Handling (3 tests)
 47  
 48  **Purpose**: Verify graceful degradation when things go wrong.
 49  
 50  - **Test 3.1**: Invalid context → graceful failure
 51    - Task missing required field (error_message)
 52    - Should fail with descriptive error message
 53    - Error logged to agent_logs table
 54  
 55  - **Test 3.2**: Malformed JSON → graceful handling
 56    - Task with invalid JSON in context_json
 57    - Agent should handle gracefully (context_json is optional)
 58  
 59  - **Test 3.3**: Retry logic (3 attempts → failed)
 60    - Task fails 3 times (missing file)
 61    - After 3 retries, marked as 'failed' with clear message
 62    - Retry count tracked correctly
 63  
 64  **Why it matters**: Production systems must handle errors gracefully. Cryptic failures or infinite loops are unacceptable.
 65  
 66  ---
 67  
 68  ### 4. Circuit Breaker (2 tests)
 69  
 70  **Purpose**: Prevent cascade failures when external services fail.
 71  
 72  - **Test 4.1**: Multiple failures → circuit opens
 73    - Simulates 5 consecutive API failures
 74    - Circuit breaker state transitions to 'open'
 75    - Failure count tracked correctly
 76  
 77  - **Test 4.2**: Circuit auto-recovers after cooldown
 78    - Circuit opened 35 minutes ago
 79    - After 30-minute cooldown, transitions to 'half_open'
 80    - Successful request closes circuit and resets failure count
 81  
 82  **Why it matters**: Prevents wasting credits on failing APIs and allows graceful degradation instead of cascade failures.
 83  
 84  ---
 85  
 86  ### 5. Task Routing (3 tests)
 87  
 88  **Purpose**: Verify Triage routes errors to correct agents.
 89  
 90  - **Test 5.1**: Security error → Security agent
 91    - Unauthorized/security-related error
 92    - Routed to Security agent with priority 10 (critical)
 93  
 94  - **Test 5.2**: Database constraint → Developer agent
 95    - UNIQUE constraint violation
 96    - Routed to Developer with suggested fix
 97  
 98  - **Test 5.3**: Network error → Architect agent
 99    - ETIMEDOUT infrastructure issue
100    - Routed to Architect (not Developer)
101  
102  **Why it matters**: Proper routing ensures tasks reach the right expert, avoiding wasted effort and incorrect fixes.
103  
104  ---
105  
106  ### 6. Priority Handling (2 tests)
107  
108  **Purpose**: Ensure high-priority tasks get processed first.
109  
110  - **Test 6.1**: High priority tasks first
111    - Creates tasks with priorities 3, 9, 5
112    - Processes one at a time
113    - Verifies priority 9 runs first, then 5, then 3
114  
115  - **Test 6.2**: Priority calculation
116    - Security error in early stage (scoring)
117    - Should calculate priority >= 8
118  
119  **Why it matters**: Critical errors (security, data loss) must be fixed before low-priority bugs. Priority scheduling prevents starvation of important work.
120  
121  ---
122  
123  ### 7. Row-level Locking (1 test)
124  
125  **Purpose**: Allow horizontal scaling without race conditions.
126  
127  - **Test 7.1**: Concurrent agents don't claim same task
128    - Enables row-level locking and horizontal scaling
129    - Creates 2 agent instances
130    - Both try to claim same task simultaneously
131    - Only one succeeds (count1 + count2 == 1)
132    - Task completed exactly once
133  
134  **Why it matters**: Horizontal scaling allows processing more tasks in parallel. Without row-level locking, agents would duplicate work or corrupt state.
135  
136  ---
137  
138  ### 8. Known Error Database (1 test)
139  
140  **Purpose**: Reuse fixes from similar past errors.
141  
142  - **Test 8.1**: Similar errors get suggested fixes
143    - Creates completed fix_bug task (known fix)
144    - Creates new error with same pattern (different line number)
145    - Triage detects known fix (similarity >= 70%)
146    - Routed task includes suggested fix
147  
148  **Why it matters**: Reduces LLM calls and fixes errors faster by learning from past successes. 70%+ similarity threshold avoids false matches.
149  
150  ---
151  
152  ### 9. Coverage Gates (1 test)
153  
154  **Purpose**: Enforce 85% coverage before commits (hard gate).
155  
156  - **Test 9.1**: Developer enforces 85% coverage
157    - Developer attempts to commit with 70% coverage
158    - Commit blocked, task marked 'blocked'
159    - QA task created to write tests
160    - **Mocks**: file ops, test runner (returns 70% coverage), LLM, git
161  
162  **Why it matters**: Coverage gates prevent technical debt accumulation. 85% is the threshold for production-quality code.
163  
164  ---
165  
166  ### 10. Workflow Dependencies (2 tests)
167  
168  **Purpose**: Enforce proper workflows (design → approval → implementation).
169  
170  - **Test 10.1**: Features require approved design
171    - implement_feature task without approved design
172    - Should block and auto-create design_proposal task
173  
174  - **Test 10.2**: Approved design enables implementation
175    - Creates approved design_proposal task
176    - Feature implementation succeeds (not blocked)
177    - **Mocks**: file ops, tests (90% coverage), LLM, git
178  
179  **Why it matters**: Follows TOGAF/SRE best practices. Design reviews prevent architecture erosion and catch issues before implementation.
180  
181  ---
182  
183  ### 11. Full Multi-Agent Workflow (1 test)
184  
185  **Purpose**: Integration test covering Monitor → Triage → Developer → QA.
186  
187  - **Test 11.1**: Full bug fix workflow
188    - Monitor detects error (simulated)
189    - Triage classifies and routes
190    - Developer fixes bug (mocked)
191    - QA task created for verification
192    - Verifies complete workflow chain
193  
194  **Why it matters**: End-to-end validation that all agents work together correctly. This is the closest test to real-world production usage.
195  
196  ---
197  
198  ## Mocking Strategy
199  
200  ### Why Mock?
201  
202  - **Speed**: Tests run in seconds instead of minutes
203  - **Reliability**: No external API dependencies
204  - **Cost**: Zero LLM credits spent on tests
205  - **Determinism**: Tests don't fail due to external service issues
206  
207  ### What's Mocked?
208  
209  1. **File Operations** (`file-operations.js`)
210     - `readFile()` - Returns mock file content
211     - `editFile()` - Returns mock backup path and diff
212     - `writeFile()` - Returns mock backup path
213     - `getFileContext()` - Returns mock imports/test files
214  
215  2. **Test Runner** (`test-runner.js`)
216     - `runTestsForFile()` - Returns mock test results
217     - `runTests()` - Returns mock coverage data
218     - Allows simulating pass/fail scenarios and coverage percentages
219  
220  3. **LLM API** (`agent-claude-api.js`)
221     - `simpleLLMCall()` - Returns mock JSON fixes/implementations
222     - Avoids hitting OpenRouter API during tests
223     - Returns deterministic responses for assertions
224  
225  4. **Git Commands** (`child_process.execSync`)
226     - Returns mock commit hashes
227     - Avoids actual git commits during tests
228  
229  5. **Filesystem** (`fs.readFile` for coverage data)
230     - Returns mock coverage-summary.json
231     - Allows testing coverage gate logic
232  
233  ### What's NOT Mocked?
234  
235  - Database operations (uses real SQLite in-memory DB)
236  - Agent logic (BaseAgent, TriageAgent, DeveloperAgent, QAAgent)
237  - Task manager (task creation, status transitions)
238  - Message manager (inter-agent messaging)
239  
240  ---
241  
242  ## Running the Tests
243  
244  ### Full Suite
245  
246  ```bash
247  npm test tests/agents/e2e-agent-system.test.js
248  ```
249  
250  ### Individual Test Group
251  
252  ```bash
253  npm test tests/agents/e2e-agent-system.test.js -- --grep "Task Lifecycle"
254  npm test tests/agents/e2e-agent-system.test.js -- --grep "Inter-agent Communication"
255  npm test tests/agents/e2e-agent-system.test.js -- --grep "Circuit Breaker"
256  ```
257  
258  ### Debug Mode
259  
260  ```bash
261  DEBUG=1 npm test tests/agents/e2e-agent-system.test.js
262  ```
263  
264  ---
265  
266  ## Expected Results
267  
268  ### Success Criteria
269  
270  - All 18 tests pass
271  - No unhandled promise rejections
272  - No database connection leaks
273  - Test database cleaned up after each test
274  
275  ### Coverage Goals
276  
277  - Agent system core: 80%+ line coverage
278  - Task lifecycle: 90%+ coverage
279  - Error handling: 85%+ coverage
280  
281  ---
282  
283  ## Troubleshooting
284  
285  ### Test Failures
286  
287  **"AssertionError: task not completed"**
288  
289  - Check if agent initialized correctly (`await agent.initialize()`)
290  - Verify task has required context fields
291  - Check agent_logs table for error details
292  
293  **"Database locked"**
294  
295  - Ensure afterEach() closes database connections
296  - Call `resetBaseDb()`, `resetTaskDb()`, `resetMessageDb()`
297  - Check for concurrent test execution
298  
299  **"Mock not called"**
300  
301  - Verify mock.method() is called before agent processes task
302  - Check if import path matches exactly
303  - Ensure mock is restored in afterEach()
304  
305  ### Performance Issues
306  
307  **Tests timeout**
308  
309  - Increase timeout: `test('name', { timeout: 60000 }, async () => {...})`
310  - Check for infinite loops in agent logic
311  - Verify immediate invocation is disabled (`AGENT_IMMEDIATE_INVOCATION=false`)
312  
313  **Database pollution**
314  
315  - Ensure each test uses fresh database
316  - Verify TEST_DB_PATH is unique per test file
317  - Check afterEach() cleanup logic
318  
319  ---
320  
321  ## Future Enhancements
322  
323  ### Additional Test Scenarios
324  
325  1. **Agent Deadlock Detection** - Circular dependencies between agents
326  2. **Task Timeout Handling** - Tasks stuck in 'running' state for >1 hour
327  3. **Resource Limits** - Agent system behavior under token budget constraints
328  4. **Concurrent Workflow Execution** - Multiple bug fixes in parallel
329  5. **Partial Rollback** - Undo changes when QA rejects fix
330  
331  ### Integration Tests (Real APIs)
332  
333  - Mark with `.integration.test.js` suffix
334  - Run separately: `npm run test:integration`
335  - Use real LLM calls (small budget)
336  - Validate actual code generation quality
337  
338  ### Load Tests
339  
340  - Simulate 100+ concurrent tasks
341  - Verify row-level locking under load
342  - Measure task throughput (tasks/minute)
343  - Test database performance at scale
344  
345  ---
346  
347  ## Contributing
348  
349  ### Adding New Tests
350  
351  1. Follow existing structure (describe → test)
352  2. Use descriptive test names (what + why)
353  3. Add mocks for expensive operations
354  4. Document test purpose in comments
355  5. Verify cleanup in afterEach()
356  
357  ### Test Organization
358  
359  - Group related tests in describe() blocks
360  - Order by complexity (simple → complex)
361  - Keep test cases independent (no shared state)
362  - Use setup/teardown for common initialization
363  
364  ### Assertions
365  
366  - Use specific assertions (`strictEqual` vs `ok`)
367  - Provide descriptive failure messages
368  - Test both positive and negative cases
369  - Verify side effects (logs, messages, child tasks)
370  
371  ---
372  
373  ## References
374  
375  - Agent Architecture: `/docs/06-automation/agent-system.md`
376  - Task Manager: `/src/agents/utils/task-manager.js`
377  - Message Manager: `/src/agents/utils/message-manager.js`
378  - Base Agent: `/src/agents/base-agent.js`