OpenAI just showed receipts on what makes production agents actually fast—and it's not the LLM.
The Summary
- OpenAI published technical details on how WebSockets and connection-scoped caching cut overhead in their Codex agent, revealing the infrastructure layer that separates toy demos from production agents
- The real bottleneck in agentic workflows isn't model inference—it's the tax of HTTP handshakes, authentication, and redundant context passing
- For builders shipping agents in production, this is a blueprint: persistent connections and smart caching matter more than chasing the latest model release
The Signal
The agent hype cycle has been all vibes and no implementation details. OpenAI just changed that. Their engineering team walked through how they optimized the Codex agent loop using WebSockets in the Responses API, and the results expose where the real work happens in production agent systems.
Traditional REST API calls force agents to rebuild context on every turn. Each tool call, each model query, each function result requires a new HTTP connection, fresh authentication, and a full context payload. For a Codex agent running multiple iterations per task, that overhead compounds fast.
"The tax of HTTP handshakes and authentication becomes the bottleneck when agents need to think in tight loops."
WebSockets flip the model. One persistent connection. Context lives server-side for the duration of the session. The agent loop sends deltas, not full state. Connection-scoped caching means the model doesn't re-ingest the same system prompts or tool definitions on every call. For workflows where an agent might make 10-20 API calls to complete one user task, the latency savings stack.
OpenAI shared specific gains:
- Reduced per-call latency by eliminating connection overhead
- Lower bandwidth usage from differential updates instead of full payloads
- Faster time-to-first-token on subsequent calls due to server-side caching
This isn't academic. The difference between a Codex agent that feels sluggish and one that feels responsive comes down to these infrastructure choices. Users tolerate 200ms. They abandon at 2 seconds. WebSockets and caching are the gap.
The Implication
If you're building agents, stop optimizing prompts before you fix your plumbing. The model is fast enough. Your infrastructure probably isn't. WebSockets aren't new tech, but their application to agentic loops is the unlock. Connection-scoped caching means your agent's working memory doesn't evaporate between thoughts.
Watch for this pattern to proliferate. Anthropic, Google, and the open-source agent frameworks will follow. The companies that ship fast, responsive agents in 2025 won't be the ones with the best models. They'll be the ones who figured out the infrastructure layer first.