Cradicle Explorer

/ docs / superpowers / specs / 2026-04-02-browse-skill-testing-design.md
2026-04-02-browse-skill-testing-design.md
  1  # Browse Skill Testing Design
  2  
  3  Two-layer testing framework for `opencli browse` commands and the
  4  Claude Code skill integration.
  5  
  6  ## Goal
  7  
  8  Verify that `opencli browse` works reliably on real websites and that
  9  Claude Code can use the skill to complete browser tasks end-to-end.
 10  
 11  ## Architecture
 12  
 13  ```
 14  autoresearch/
 15  ├── browse-tasks.json       ← 59 task definitions with browse command sequences
 16  ├── eval-browse.ts          ← Layer 1: deterministic browse command testing
 17  ├── eval-skill.ts           ← Layer 2: Claude Code skill E2E testing
 18  ├── run-browse.sh           ← Launch Layer 1
 19  ├── run-skill.sh            ← Launch Layer 2
 20  ├── baseline-browse.txt     ← Layer 1 best score
 21  ├── baseline-skill.txt      ← Layer 2 best score
 22  └── results/                ← Per-run results (gitignored)
 23  ```
 24  
 25  ## Layer 1: Deterministic Browse Command Testing
 26  
 27  Tests `opencli browse` commands directly on real websites. No LLM
 28  involved — pure command reliability testing.
 29  
 30  ### How It Works
 31  
 32  Each task defines a sequence of browse commands and a judge for the
 33  last command's output:
 34  
 35  ```json
 36  {
 37    "name": "hn-top-stories",
 38    "steps": [
 39      "opencli browse open https://news.ycombinator.com",
 40      "opencli browse eval \"JSON.stringify([...document.querySelectorAll('.titleline a')].slice(0,5).map(a=>({title:a.textContent,url:a.href})))\""
 41    ],
 42    "judge": { "type": "arrayMinLength", "minLength": 5 }
 43  }
 44  ```
 45  
 46  ### Execution
 47  
 48  ```bash
 49  ./autoresearch/run-browse.sh
 50  ```
 51  
 52  - Runs all 59 tasks serially
 53  - Each task: execute steps → judge last step output → pass/fail
 54  - `opencli browse close` between tasks for clean state
 55  - Expected: ~2 minutes, $0 cost
 56  
 57  ### Task Categories
 58  
 59  | Category | Count | Example |
 60  |----------|-------|---------|
 61  | extract | 9 | Open page, eval JS to extract data |
 62  | list | 10 | Open page, eval JS to extract array |
 63  | search | 6 | Open, type query, keys Enter, eval results |
 64  | nav | 7 | Open, click link, eval new page title |
 65  | scroll | 5 | Open, scroll, eval footer/hidden content |
 66  | form | 6 | Open, type into fields, eval field values |
 67  | complex | 6 | Multi-step: open → click → navigate → extract |
 68  | bench | 10 | Test set (various) |
 69  
 70  ## Layer 2: Claude Code Skill E2E Testing
 71  
 72  Spawns Claude Code with the opencli-browser skill to complete tasks
 73  autonomously using browse commands.
 74  
 75  ### How It Works
 76  
 77  ```bash
 78  claude -p \
 79    --system-prompt "$(cat skills/opencli-browser/SKILL.md)" \
 80    --dangerously-skip-permissions \
 81    --allowedTools "Bash(opencli:*)" \
 82    --output-format json \
 83    "用 opencli browse 完成任务：Extract the top 5 stories from Hacker News with title and score. Start URL: https://news.ycombinator.com"
 84  ```
 85  
 86  ### Execution
 87  
 88  ```bash
 89  ./autoresearch/run-skill.sh
 90  ```
 91  
 92  - Runs all 59 tasks serially
 93  - Each task: spawn Claude Code → it uses browse commands autonomously → judge output
 94  - Expected: ~20 minutes, ~$5-10
 95  
 96  ### Judge
 97  
 98  Both layers use the same judge types:
 99  
100  | Type | Description |
101  |------|-------------|
102  | `contains` | Output contains a substring |
103  | `arrayMinLength` | Output is an array with ≥ N items |
104  | `arrayFieldsPresent` | Array items have required fields |
105  | `nonEmpty` | Output is non-empty |
106  | `matchesPattern` | Output matches a regex |
107  
108  ## Output Format
109  
110  ```
111  🔬 Layer 1: Browse Commands — 59 tasks
112  
113    [1/59] extract-title-example... ✓ (0.5s)
114    [2/59] hn-top-stories... ✓ (1.2s)
115    ...
116  
117    Score: 55/59 (93%)
118    Time: 2min
119    Cost: $0
120  
121  🔬 Layer 2: Skill E2E — 59 tasks
122  
123    [1/59] extract-title-example... ✓ (8s, $0.01)
124    [2/59] hn-top-stories... ✓ (15s, $0.08)
125    ...
126  
127    Score: 52/59 (88%)
128    Time: 20min
129    Cost: $6.50
130  ```
131  
132  ## Constraints
133  
134  - All 59 tasks run on real websites (no mocks)
135  - Layer 1: zero LLM cost, ~2 min
136  - Layer 2: ~$5-10 LLM cost, ~20 min
137  - Results saved to `autoresearch/results/` (gitignored)
138  - Baselines tracked in `baseline-browse.txt` and `baseline-skill.txt`
139  
140  ## Success Criteria
141  
142  - Layer 1 ≥ 90% (browse commands work on real sites)
143  - Layer 2 ≥ 85% (Claude Code can use skill effectively)
144  - Both layers cover all 8 task categories