2026-04-02-browse-skill-testing-design.md
1 # Browse Skill Testing Design 2 3 Two-layer testing framework for `opencli browse` commands and the 4 Claude Code skill integration. 5 6 ## Goal 7 8 Verify that `opencli browse` works reliably on real websites and that 9 Claude Code can use the skill to complete browser tasks end-to-end. 10 11 ## Architecture 12 13 ``` 14 autoresearch/ 15 ├── browse-tasks.json ← 59 task definitions with browse command sequences 16 ├── eval-browse.ts ← Layer 1: deterministic browse command testing 17 ├── eval-skill.ts ← Layer 2: Claude Code skill E2E testing 18 ├── run-browse.sh ← Launch Layer 1 19 ├── run-skill.sh ← Launch Layer 2 20 ├── baseline-browse.txt ← Layer 1 best score 21 ├── baseline-skill.txt ← Layer 2 best score 22 └── results/ ← Per-run results (gitignored) 23 ``` 24 25 ## Layer 1: Deterministic Browse Command Testing 26 27 Tests `opencli browse` commands directly on real websites. No LLM 28 involved — pure command reliability testing. 29 30 ### How It Works 31 32 Each task defines a sequence of browse commands and a judge for the 33 last command's output: 34 35 ```json 36 { 37 "name": "hn-top-stories", 38 "steps": [ 39 "opencli browse open https://news.ycombinator.com", 40 "opencli browse eval \"JSON.stringify([...document.querySelectorAll('.titleline a')].slice(0,5).map(a=>({title:a.textContent,url:a.href})))\"" 41 ], 42 "judge": { "type": "arrayMinLength", "minLength": 5 } 43 } 44 ``` 45 46 ### Execution 47 48 ```bash 49 ./autoresearch/run-browse.sh 50 ``` 51 52 - Runs all 59 tasks serially 53 - Each task: execute steps → judge last step output → pass/fail 54 - `opencli browse close` between tasks for clean state 55 - Expected: ~2 minutes, $0 cost 56 57 ### Task Categories 58 59 | Category | Count | Example | 60 |----------|-------|---------| 61 | extract | 9 | Open page, eval JS to extract data | 62 | list | 10 | Open page, eval JS to extract array | 63 | search | 6 | Open, type query, keys Enter, eval results | 64 | nav | 7 | Open, click link, eval new page title | 65 | scroll | 5 | Open, scroll, eval footer/hidden content | 66 | form | 6 | Open, type into fields, eval field values | 67 | complex | 6 | Multi-step: open → click → navigate → extract | 68 | bench | 10 | Test set (various) | 69 70 ## Layer 2: Claude Code Skill E2E Testing 71 72 Spawns Claude Code with the opencli-browser skill to complete tasks 73 autonomously using browse commands. 74 75 ### How It Works 76 77 ```bash 78 claude -p \ 79 --system-prompt "$(cat skills/opencli-browser/SKILL.md)" \ 80 --dangerously-skip-permissions \ 81 --allowedTools "Bash(opencli:*)" \ 82 --output-format json \ 83 "用 opencli browse 完成任务:Extract the top 5 stories from Hacker News with title and score. Start URL: https://news.ycombinator.com" 84 ``` 85 86 ### Execution 87 88 ```bash 89 ./autoresearch/run-skill.sh 90 ``` 91 92 - Runs all 59 tasks serially 93 - Each task: spawn Claude Code → it uses browse commands autonomously → judge output 94 - Expected: ~20 minutes, ~$5-10 95 96 ### Judge 97 98 Both layers use the same judge types: 99 100 | Type | Description | 101 |------|-------------| 102 | `contains` | Output contains a substring | 103 | `arrayMinLength` | Output is an array with ≥ N items | 104 | `arrayFieldsPresent` | Array items have required fields | 105 | `nonEmpty` | Output is non-empty | 106 | `matchesPattern` | Output matches a regex | 107 108 ## Output Format 109 110 ``` 111 🔬 Layer 1: Browse Commands — 59 tasks 112 113 [1/59] extract-title-example... ✓ (0.5s) 114 [2/59] hn-top-stories... ✓ (1.2s) 115 ... 116 117 Score: 55/59 (93%) 118 Time: 2min 119 Cost: $0 120 121 🔬 Layer 2: Skill E2E — 59 tasks 122 123 [1/59] extract-title-example... ✓ (8s, $0.01) 124 [2/59] hn-top-stories... ✓ (15s, $0.08) 125 ... 126 127 Score: 52/59 (88%) 128 Time: 20min 129 Cost: $6.50 130 ``` 131 132 ## Constraints 133 134 - All 59 tasks run on real websites (no mocks) 135 - Layer 1: zero LLM cost, ~2 min 136 - Layer 2: ~$5-10 LLM cost, ~20 min 137 - Results saved to `autoresearch/results/` (gitignored) 138 - Baselines tracked in `baseline-browse.txt` and `baseline-skill.txt` 139 140 ## Success Criteria 141 142 - Layer 1 ≥ 90% (browse commands work on real sites) 143 - Layer 2 ≥ 85% (Claude Code can use skill effectively) 144 - Both layers cover all 8 task categories