Three Hard Lessons from Running Async Python in Production

Building Hawkeye involved a lot of async Python: an async web framework, an async database driver, an async job queue, and an async agent that runs for minutes at a time. Most of it worked smoothly. Three things did not, and each failure taught us something we hadn't known before.

Lesson One: Connection Pools Don't Survive Event Loop Changes

The first time a background worker executed a test run, everything worked. The second time, it crashed with a cryptic error about a closed connection.

The root cause took a while to find. Our job queue uses a process model where each task creates a brand new event loop. The database connection pool was created during the first task's event loop. On the second task, the pool tried to reuse connections from the old, now-closed loop — and the database driver correctly rejected them.

The fix was conceptually simple once we understood the problem: don't pool connections across tasks. Use a pool that opens a fresh connection for every operation and closes it immediately. More overhead per query, but no cross-task state.

The same bug existed in the Redis client. The global connection singleton was bound to the first task's event loop. We fixed it by detecting when the current loop differs from when the connection was created, and recreating the connection pool when that happens.

The lesson: in a process-based task queue with asyncio, treat every task as a fresh process. Any connection created outside a task's event loop will eventually fail.

Lesson Two: Scheduling a Task Is Not Executing It

Our live trace streaming system showed a strange symptom: every log event appeared one step late. The "observe" event showed up just as the "reason" phase started. The "reason" event appeared when "act" began.

The agent was working correctly. The streaming was working correctly. But there was a one-step lag that made the live feed feel disconnected from reality.

The root cause was a single line of code. When emitting a real-time event, we were scheduling the network publish rather than awaiting it. In Python's async model, a scheduled task only runs when the event loop gets control — which happens at the next await point. In our observe phase, the next await was the browser action call. In the reason phase, it was the LLM invocation.

The fix was a single zero-second sleep after each emit. This yields control to the event loop for one iteration, forcing all pending scheduled tasks to execute before the next operation begins. It's a pattern that appears in async Python documentation but is easy to miss in practice.

The lesson: scheduling and executing are not the same thing. If you need an async task to complete before the next operation, await it or explicitly yield to the event loop.

Lesson Three: Always Drain Subprocess Output

On Windows, the agent would occasionally hang completely — no response from the browser control subprocess, no error message, just silence.

The cause was a 65-kilobyte pipe buffer. The browser control process was writing diagnostic messages to its error output. Nobody was reading them. When the buffer filled, the subprocess blocked on its next write attempt. Because output streams share the same underlying process, the main output channel stopped producing responses too.

The fix was a background task that continuously reads the error output for the lifetime of the subprocess, discarding or logging each line as it arrives. Simple, but easy to overlook.

The lesson generalizes beyond Windows: whenever you spawn a subprocess, read all its output streams, always. Unread output will eventually block the process, and the failure mode — silent hang with no error message — is one of the hardest to diagnose.

The Pattern Across All Three

All three bugs shared a structure: a resource that worked correctly in isolation, failed silently when the environment changed, and took significant investigation to diagnose because the error message pointed to a symptom rather than a cause.

That's the nature of production async systems. The unit tests pass. The happy path works. The failures appear in the combination of concurrency, process model, and platform behavior that only shows up under real load. The best defense is aggressive logging and an understanding of where your runtime's assumptions can break.