README.md
1 # Loop detector benchmarks 2 3 14-task end-to-end benchmark used to validate agent-loop reliability work 4 (loop detector gate, force-stop synthesis, empty-result heuristic). 5 6 ## Scripts 7 8 - `driver.sh` — 8 coding scenarios (grep, batch edit, audit analysis, codebase 9 comparison, bulk test generation, tracing, tool selection, TODO scan). 10 Modifying tasks (2, 5) run in throwaway worktrees under `/tmp/maxiter_tests/`. 11 - `driver_tob.sh` — 6 toB daily-task scenarios (calendar, inbox, drive, notion). 12 Read-only / draft-free — no side effects on the user's accounts. 13 - `analyze.py` — post-run parser: reads a session JSON + audit log and emits 14 per-task metrics (LLM calls, tool distribution, consecutive streaks, failures, 15 max-iter synthesis detection, cost). 16 17 ## Running 18 19 Requires the `shan` binary on `$PATH` and a configured Shannon Cloud endpoint. 20 Tasks that call Calendar / Gmail / Drive / Notion need MCP servers configured. 21 22 ```bash 23 # From anywhere — scripts resolve repo root via BASH_SOURCE 24 ./test/benchmarks/driver.sh # coding scenarios 25 ./test/benchmarks/driver_tob.sh # toB scenarios 26 27 # Override where results land 28 BENCHMARK_RESULTS_DIR=/path/to/out ./test/benchmarks/driver.sh 29 30 # Analyze one task 31 ./test/benchmarks/analyze.py <session_id> <task_num> <task_name> 32 ``` 33 34 Per-task artifacts (stdout, session_id, driver.log) land under 35 `$BENCHMARK_RESULTS_DIR` (defaults to `/tmp/maxiter_tests/results` and 36 `/tmp/maxiter_tests/results_tob`). 37 38 ## Expected behavior after Phase 1 gate 39 40 The loop detector previously force-stopped Task 5 (coding, ~14 bash calls) and 41 Task 6 (toB, ~10 MCP calls) despite both being legitimate batch operations with 42 unique arguments. After the `batchTolerant` uniqueness gate lands: 43 44 - Task 5: completes all 19 tool calls without force-stop. 45 - Task 6: completes all 16 database queries; row counts returned. 46 - Tasks that don't hit the detector (1, 3, 4, 7, 8 / 1, 2, 3, 4, 5) must remain 47 within ±1 tool call of their pre-fix trajectory — uniqueness gate must not 48 accidentally relax generic `think` / `http` / `file_*` spin detection.