Building an Observable Browser Sandbox

Running a browser in a Docker container is easy. Making it observable — where a human can watch it in real time, where the agent can take screenshots, where test runs can be recorded as video — turns out to require a carefully layered stack.

We built the Hawkeye sandbox before we built the agent. The reasoning was simple: without an environment to test against, there's nothing to test. And without observability, debugging agent behavior would be guesswork.

The Five-Layer Stack

The sandbox container runs five services simultaneously. A virtual display renders the browser to memory rather than a physical screen. A VNC server exposes that virtual display over the network. A WebSocket bridge makes VNC accessible from any browser tab. A noVNC viewer lets you watch the test run live in a browser window, without installing any client software.

Alongside that, a protocol proxy exposes the browser's internal debugging interface to the outside world. This is how the agent captures screenshots, monitors network requests, and reads console output — through a separate channel from the browser actions themselves.

And finally, when recording is enabled, a video encoder captures the virtual display continuously, encoding it to a standard video format that can be reviewed after the run.

Two Protocols, One Browser

The most interesting architectural decision was using two separate protocols to communicate with the same browser instance.

The first protocol handles actions: clicking, typing, navigating, reading the page structure. It's abstracted at the browser level, which means the same agent code works with Chromium, Firefox, and WebKit without modification.

The second protocol handles observation: screenshots, network traffic, console messages, JavaScript evaluation. It connects to the browser's internal debugging interface directly.

The separation exists because neither protocol does everything. The action protocol is deliberately limited — it exposes the interactions a user would perform, not internal browser data. The debugging protocol fills those gaps.

The timing between them matters. During test setup, the agent navigates to the target page through the action protocol first, then connects the debugging session to the correct tab. Get that order wrong and the debugging session attaches to a blank tab.

Why Live Viewing Matters

The noVNC live view turned out to be one of the most useful features we built, for reasons we didn't fully anticipate.

The obvious use is debugging: when a test fails, you can watch the replay and understand exactly what the agent saw and why it made the decisions it made. But the more valuable use is trust-building. When a stakeholder can watch an AI agent navigate their application in real time — clicking buttons, filling forms, scrolling through content — the abstract concept of "AI testing" becomes concrete and legible. The VNC view is how we explain what Hawkeye actually does.

Container Isolation

Each test run gets its own container. There's no shared state between runs, no cookie persistence across tests, no risk of one test's actions affecting another. When the run completes, the container is destroyed and everything inside it disappears.

This isolation is what makes parallel execution safe. Two tests can run simultaneously against the same application without interfering with each other. The container pool pre-warms containers to eliminate the spawn latency — typically ten to fifteen seconds — from hot runs.