browser-supervisor.md
1 # Browser CDP Supervisor — Design 2 3 **Status:** Shipped (PR 14540) 4 **Last updated:** 2026-04-23 5 **Author:** @teknium1 6 7 ## Problem 8 9 Native JS dialogs (`alert`/`confirm`/`prompt`/`beforeunload`) and iframes are 10 the two biggest gaps in our browser tooling: 11 12 1. **Dialogs block the JS thread.** Any operation on the page stalls until the 13 dialog is handled. Before this work, the agent had no way to know a dialog 14 was open — subsequent tool calls would hang or throw opaque errors. 15 2. **Iframes are invisible.** The agent could see iframe nodes in the DOM 16 snapshot but could not click, type, or eval inside them — especially 17 cross-origin (OOPIF) iframes that live in separate Chromium processes. 18 19 [PR #12550](https://github.com/NousResearch/hermes-agent/pull/12550) proposed a 20 stateless `browser_dialog` wrapper. That doesn't solve detection — it's a 21 cleaner CDP call for when the agent already knows (via symptoms) that a dialog 22 is open. Closed as superseded. 23 24 ## Backend capability matrix (verified live 2026-04-23) 25 26 Using throwaway probe scripts against a data-URL page that fires alerts in the 27 main frame and in a same-origin srcdoc iframe, plus a cross-origin 28 `https://example.com` iframe: 29 30 | Backend | Dialog detect | Dialog respond | Frame tree | OOPIF `Runtime.evaluate` via `browser_cdp(frame_id=...)` | 31 |---|---|---|---|---| 32 | Local Chrome (`--remote-debugging-port`) / `/browser connect` | ✓ | ✓ full workflow | ✓ | ✓ | 33 | Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ (`document.title = "Example Domain"` verified on real cross-origin iframe) | 34 | Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ | 35 36 **How Browserbase respond works.** Browserbase's CDP proxy uses Playwright 37 internally and auto-dismisses native dialogs within ~10ms, so 38 `Page.handleJavaScriptDialog` can't keep up. To work around this, the 39 supervisor injects a bridge script via 40 `Page.addScriptToEvaluateOnNewDocument` that overrides 41 `window.alert`/`confirm`/`prompt` with a synchronous XHR to a magic host 42 (`hermes-dialog-bridge.invalid`). `Fetch.enable` intercepts those XHRs 43 before they touch the network — the dialog becomes a `Fetch.requestPaused` 44 event the supervisor captures, and `respond_to_dialog` fulfills via 45 `Fetch.fulfillRequest` with a JSON body the injected script decodes. 46 47 Net result: from the page's perspective, `prompt()` still returns the 48 agent-supplied string. From the agent's perspective, it's the same 49 `browser_dialog(action=...)` API either way. Tested end-to-end against 50 real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss) 51 pass including value round-tripping back into page JS. 52 53 Camofox stays unsupported for this PR; follow-up upstream issue planned at 54 `jo-inc/camofox-browser` requesting a dialog polling endpoint. 55 56 ## Architecture 57 58 ### CDPSupervisor 59 60 One `asyncio.Task` running in a background daemon thread per Hermes `task_id`. 61 Holds a persistent WebSocket to the backend's CDP endpoint. Maintains: 62 63 - **Dialog queue** — `List[PendingDialog]` with `{id, type, message, default_prompt, session_id, opened_at}` 64 - **Frame tree** — `Dict[frame_id, FrameInfo]` with parent relationships, URL, origin, whether cross-origin child session 65 - **Session map** — `Dict[session_id, SessionInfo]` so interaction tools can route to the right attached session for OOPIF operations 66 - **Recent console errors** — ring buffer of the last 50 (for PR 2 diagnostics) 67 68 Subscribes on attach: 69 - `Page.enable` — `javascriptDialogOpening`, `frameAttached`, `frameNavigated`, `frameDetached` 70 - `Runtime.enable` — `executionContextCreated`, `consoleAPICalled`, `exceptionThrown` 71 - `Target.setAutoAttach {autoAttach: true, flatten: true}` — surfaces child OOPIF targets; supervisor enables `Page`+`Runtime` on each 72 73 Thread-safe state access via a snapshot lock; tool handlers (sync) read the 74 frozen snapshot without awaiting. 75 76 ### Lifecycle 77 78 - **Start:** `SupervisorRegistry.get_or_start(task_id, cdp_url)` — called by 79 `browser_navigate`, Browserbase session create, `/browser connect`. Idempotent. 80 - **Stop:** session teardown or `/browser disconnect`. Cancels the asyncio 81 task, closes the WebSocket, discards state. 82 - **Rebind:** if the CDP URL changes (user reconnects to a new Chrome), stop 83 the old supervisor and start fresh — never reuse state across endpoints. 84 85 ### Dialog policy 86 87 Configurable via `config.yaml` under `browser.dialog_policy`: 88 89 - **`must_respond`** (default) — capture, surface in `browser_snapshot`, wait 90 for explicit `browser_dialog(action=...)` call. After a 300s safety timeout 91 with no response, auto-dismiss and log. Prevents a buggy agent from stalling 92 forever. 93 - `auto_dismiss` — record and dismiss immediately; agent sees it after the 94 fact via `browser_state` inside `browser_snapshot`. 95 - `auto_accept` — record and accept (useful for `beforeunload` where the user 96 wants to navigate away cleanly). 97 98 Policy is per-task; no per-dialog overrides in v1. 99 100 ## Agent surface (PR 1) 101 102 ### One new tool 103 104 ``` 105 browser_dialog(action, prompt_text=None, dialog_id=None) 106 ``` 107 108 - `action="accept"` / `"dismiss"` → responds to the specified or sole pending dialog (required) 109 - `prompt_text=...` → text to supply to a `prompt()` dialog 110 - `dialog_id=...` → disambiguate when multiple dialogs queued (rare) 111 112 Tool is response-only. Agent reads pending dialogs from `browser_snapshot` 113 output before calling. 114 115 ### `browser_snapshot` extension 116 117 Adds three optional fields to the existing snapshot output when a supervisor 118 is attached: 119 120 ```json 121 { 122 "pending_dialogs": [ 123 {"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0} 124 ], 125 "recent_dialogs": [ 126 {"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0, 127 "closed_at": 1650000000.1, "closed_by": "remote"} 128 ], 129 "frame_tree": { 130 "top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"}, 131 "children": [ 132 {"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false}, 133 {"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"} 134 ], 135 "truncated": false 136 } 137 } 138 ``` 139 140 - **`pending_dialogs`**: dialogs currently blocking the page's JS thread. 141 The agent must call `browser_dialog(action=...)` to respond. Empty on 142 Browserbase because their CDP proxy auto-dismisses within ~10ms. 143 144 - **`recent_dialogs`**: ring buffer of up to 20 recently-closed dialogs with 145 a `closed_by` tag — `"agent"` (we responded), `"auto_policy"` (local 146 auto_dismiss/auto_accept), `"watchdog"` (must_respond timeout hit), or 147 `"remote"` (browser/backend closed it on us, e.g. Browserbase). This is 148 how agents on Browserbase still get visibility into what happened. 149 150 - **`frame_tree`**: frame structure including cross-origin (OOPIF) children. 151 Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy 152 pages. `truncated: true` surfaces when limits were hit; agents needing 153 the full tree can use `browser_cdp` with `Page.getFrameTree`. 154 155 No new tool schema surface for any of these — the agent reads the snapshot 156 it already requests. 157 158 ### Availability gating 159 160 Both surfaces gate on `_browser_cdp_check` (supervisor can only run when a CDP 161 endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is 162 hidden and snapshot omits the new fields — no schema bloat. 163 164 ## Cross-origin iframe interaction 165 166 Extending the dialog-detect work, `browser_cdp(frame_id=...)` routes CDP 167 calls (notably `Runtime.evaluate`) through the supervisor's already-connected 168 WebSocket using the OOPIF's child `sessionId`. Agents pick frame_ids out of 169 `browser_snapshot.frame_tree.children[]` where `is_oopif=true` and pass them 170 to `browser_cdp`. For same-origin iframes (no dedicated CDP session), the 171 agent uses `contentWindow`/`contentDocument` from a top-level 172 `Runtime.evaluate` instead — supervisor surfaces an error pointing at that 173 fallback when `frame_id` belongs to a non-OOPIF. 174 175 On Browserbase, this is the ONLY reliable path for iframe interaction — 176 stateless CDP connections (opened per `browser_cdp` call) hit signed-URL 177 expiry, while the supervisor's long-lived connection keeps a valid session. 178 179 ## Camofox (follow-up) 180 181 Issue planned against `jo-inc/camofox-browser` adding: 182 - Playwright `page.on('dialog', handler)` per session 183 - `GET /tabs/:tabId/dialogs` polling endpoint 184 - `POST /tabs/:tabId/dialogs/:id` to accept/dismiss 185 - Frame-tree introspection endpoint 186 187 ## Files touched (PR 1) 188 189 ### New 190 191 - `tools/browser_supervisor.py` — `CDPSupervisor`, `SupervisorRegistry`, `PendingDialog`, `FrameInfo` 192 - `tools/browser_dialog_tool.py` — `browser_dialog` tool handler 193 - `tests/tools/test_browser_supervisor.py` — mock CDP WebSocket server + lifecycle/state tests 194 - `website/docs/developer-guide/browser-supervisor.md` — this file 195 196 ### Modified 197 198 - `toolsets.py` — register `browser_dialog` in `browser`, `hermes-acp`, `hermes-api-server`, core toolsets (gated on CDP reachability) 199 - `tools/browser_tool.py` 200 - `browser_navigate` start-hook: if CDP URL resolvable, `SupervisorRegistry.get_or_start(task_id, cdp_url)` 201 - `browser_snapshot` (at ~line 1536): merge supervisor state into return payload 202 - `/browser connect` handler: restart supervisor with new endpoint 203 - Session teardown hooks in `_cleanup_browser_session` 204 - `hermes_cli/config.py` — add `browser.dialog_policy` and `browser.dialog_timeout_s` to `DEFAULT_CONFIG` 205 - Docs: `website/docs/user-guide/features/browser.md`, `website/docs/reference/tools-reference.md`, `website/docs/reference/toolsets-reference.md` 206 207 ## Non-goals 208 209 - Detection/interaction for Camofox (upstream gap; tracked separately) 210 - Streaming dialog/frame events live to the user (would require gateway hooks) 211 - Persisting dialog history across sessions (in-memory only) 212 - Per-iframe dialog policies (agent can express this via `dialog_id`) 213 - Replacing `browser_cdp` — it stays as the escape hatch for the long tail (cookies, viewport, network throttling) 214 215 ## Testing 216 217 Unit tests use an asyncio mock CDP server that speaks enough of the protocol 218 to exercise all state transitions: attach, enable, navigate, dialog fire, 219 dialog dismiss, frame attach/detach, child target attach, session teardown. 220 Real-backend E2E (Browserbase + local Chrome) is manual; probe scripts from 221 the 2026-04-23 investigation kept in-repo under 222 `scripts/browser_supervisor_e2e.py` so anyone can re-verify on new backend 223 versions.