We shipped the first version of Hawkeye — a command-line tool that could run a single test against a browser sandbox — in under a week. What followed was 20 days of building everything else: the API, the job queue, the real-time streaming, the frontend, the vault, the billing, the organization management, and the production hardening.
208 commits. Five layers of infrastructure. Two people doing the bulk of the work.
How We Structured the Build
We didn't build linearly. We built in epochs, each one unlocking the next.
Epoch one was the sandbox — a Docker container with a full browser, a virtual display, and a live VNC viewer so you could watch the agent work in real time. Without that, there was nothing for an agent to interact with.
Epoch two was the agent itself — a state machine that could observe a page, reason about what to do, execute an action, and check whether the goal was complete. The state machine approach turned out to be far more valuable than a simple loop, because it gave us explicit places to add guard-rails, error recovery, and goal verification.
Epochs three and four turned the agent into a platform: a REST API, a job queue so tests could run in the background, WebSocket streaming so the frontend could show live progress, and a full multi-tenant SaaS with projects, test cases, vault secrets, schedules, and billing.
The 94-Commit Sprint
Phase 6 — what we internally called production hardening — was the most intense period. 94 commits in six days, across seven parallel tracks: database persistence, auth middleware, Stripe billing, GitHub CI integration, organization management, and a dozen smaller fixes.
This is where the real engineering happened. Not building features, but making them reliable. Connection pools that broke under Celery's process model. Real-time log events that appeared one step late due to asyncio task scheduling. Guard-rails that had to maintain conversation validity when blocking an agent's action.
Each of these bugs was invisible until we ran the system under real load with real LLM calls. That's the nature of distributed async systems — the problems don't show up in unit tests.
What We'd Do Differently
We'd invest earlier in observability. The hardest bugs to fix were the ones we couldn't see — events arriving late, connections silently failing on the second task, middleware running in the wrong order. Better internal logging from day one would have saved days of debugging.
We'd also be more aggressive about the SQLite fallback from the start. One of the best decisions we made was designing the database layer so the entire stack runs with zero infrastructure — no PostgreSQL, no containers, just a local file. That made local development fast and iteration tight. We added it late; we should have started there.
What Surprised Us
How much of the complexity was in the details, not the features. The features — visual agent, vault encryption, Stripe billing — each took hours to build. The details — flushing async tasks before the next operation, draining subprocess stderr to prevent pipe deadlocks, ordering middleware correctly so CORS preflight requests don't get rejected by auth — took days.
That ratio is normal in production systems. The lesson is to allocate time accordingly.