Why Faster Agent Loops Matter
OpenAI says it has reworked the plumbing behind its Responses API to make agent-style workflows substantially faster, a change aimed at reducing the time users spend waiting while tools, models, and API calls bounce back and forth during complex tasks.
In a technical post published April 22, the company described how systems such as Codex can require dozens of sequential requests to complete a single assignment: the model decides what to do next, a tool runs on the client side, the result is sent back to the API, and the cycle repeats. That pattern makes even small amounts of overhead add up quickly.
According to OpenAI, the performance problem became more visible as inference itself sped up. Earlier flagship models in the Responses API ran at about 65 tokens per second, the company said. For GPT-5.3-Codex-Spark, OpenAI targeted more than 1,000 tokens per second using Cerebras hardware. Once model generation became that fast, the slower parts of the loop were no longer easy to hide.
From Inference Bottleneck to API Bottleneck
OpenAI breaks agent latency into three broad stages: API service work, model inference, and client-side time. The client side still matters because tools need to execute and context needs to be assembled, but the company said the API layer itself had become a meaningful bottleneck.
That shift forced a different optimization strategy. Instead of focusing only on GPU throughput, OpenAI says it began removing friction across the request path. Around November 2025, the company launched what it called a performance sprint on the Responses API. The work included caching rendered tokens and model configuration in memory, reducing extra network hops by calling inference services more directly, and speeding up parts of the safety stack so some conversations could be classified faster.
Those changes improved time to first token by nearly 45%, according to the company. But OpenAI says that was still not enough to fully expose the speed gains of its newer inference stack.
The WebSocket Shift
The larger change was architectural: replacing a series of separate synchronous API calls with a persistent connection to the Responses API using WebSockets. In practical terms, that means the client and the API can stay connected across the full agent loop rather than constantly tearing down and rebuilding request state.
OpenAI says persistent sessions allowed it to keep useful information attached to the connection itself. That reduced repeated setup work and helped the system reuse context more efficiently across turns. The result, the company said, was a roughly 40% improvement in end-to-end agent loop speed.
For users, the significance is straightforward. If a coding or research agent needs many tool calls to finish a job, shaving overhead from every cycle can have a bigger effect than speeding up only one stage. A workflow that once felt stalled between actions can start to feel closer to a live interaction.
What OpenAI Optimized
- Connection-scoped caching to avoid repeating expensive setup work.
- Fewer unnecessary network hops between API services and inference services.
- Faster safety checks in parts of the moderation and classification pipeline.
- A persistent WebSocket channel to reduce the cost of many-turn tool use.
OpenAI framed the work as a response to a broader industry change: inference is getting fast enough that surrounding systems increasingly determine perceived product quality. In that environment, a model may be able to think quickly, but the experience can still feel slow if orchestration layers lag behind.
Why This Matters Beyond Codex
Although OpenAI illustrated the problem with Codex, the implications extend to any tool-using agent. Enterprise assistants, customer-service systems, research copilots, and software agents all depend on many small interactions rather than one long model completion. Persistent sessions and lower orchestration overhead could therefore matter just as much as raw benchmark performance.
The post also offers a glimpse into a changing competitive landscape. Model vendors have spent years emphasizing better reasoning and larger context windows. Increasingly, however, they are also competing on systems engineering: throughput, responsiveness, safety latency, and how efficiently a model can stay in the loop with external tools.
OpenAI’s message is that the infrastructure around the model is now a product feature in its own right. If inference speeds continue to rise, that will likely become even more true.
The Bigger Signal
The deeper takeaway is not just that WebSockets are faster than repeated synchronous calls. It is that agent products are maturing into real-time software systems whose performance depends on coordination across APIs, caches, safety layers, and tool runtimes.
That makes this update more than an engineering footnote. It is a sign that the next gains in AI usability may come from reducing friction between model steps, not only from making each individual step smarter. As agentic systems take on longer and more complicated tasks, that distinction could determine whether they feel experimental or operational.
This article is based on reporting by OpenAI. Read the original article.
Originally published on openai.com








