/ test / benchmarks / README.md
README.md
 1  # Loop detector benchmarks
 2  
 3  14-task end-to-end benchmark used to validate agent-loop reliability work
 4  (loop detector gate, force-stop synthesis, empty-result heuristic).
 5  
 6  ## Scripts
 7  
 8  - `driver.sh` — 8 coding scenarios (grep, batch edit, audit analysis, codebase
 9    comparison, bulk test generation, tracing, tool selection, TODO scan).
10    Modifying tasks (2, 5) run in throwaway worktrees under `/tmp/maxiter_tests/`.
11  - `driver_tob.sh` — 6 toB daily-task scenarios (calendar, inbox, drive, notion).
12    Read-only / draft-free — no side effects on the user's accounts.
13  - `analyze.py` — post-run parser: reads a session JSON + audit log and emits
14    per-task metrics (LLM calls, tool distribution, consecutive streaks, failures,
15    max-iter synthesis detection, cost).
16  
17  ## Running
18  
19  Requires the `shan` binary on `$PATH` and a configured Shannon Cloud endpoint.
20  Tasks that call Calendar / Gmail / Drive / Notion need MCP servers configured.
21  
22  ```bash
23  # From anywhere — scripts resolve repo root via BASH_SOURCE
24  ./test/benchmarks/driver.sh         # coding scenarios
25  ./test/benchmarks/driver_tob.sh     # toB scenarios
26  
27  # Override where results land
28  BENCHMARK_RESULTS_DIR=/path/to/out ./test/benchmarks/driver.sh
29  
30  # Analyze one task
31  ./test/benchmarks/analyze.py <session_id> <task_num> <task_name>
32  ```
33  
34  Per-task artifacts (stdout, session_id, driver.log) land under
35  `$BENCHMARK_RESULTS_DIR` (defaults to `/tmp/maxiter_tests/results` and
36  `/tmp/maxiter_tests/results_tob`).
37  
38  ## Expected behavior after Phase 1 gate
39  
40  The loop detector previously force-stopped Task 5 (coding, ~14 bash calls) and
41  Task 6 (toB, ~10 MCP calls) despite both being legitimate batch operations with
42  unique arguments. After the `batchTolerant` uniqueness gate lands:
43  
44  - Task 5: completes all 19 tool calls without force-stop.
45  - Task 6: completes all 16 database queries; row counts returned.
46  - Tasks that don't hit the detector (1, 3, 4, 7, 8 / 1, 2, 3, 4, 5) must remain
47    within ±1 tool call of their pre-fix trajectory — uniqueness gate must not
48    accidentally relax generic `think` / `http` / `file_*` spin detection.