OpenAI Says WebSockets Made Agentic API Workflows About 40% Faster

Why Faster Agent Loops Matter

OpenAI says it has reworked the plumbing behind its Responses API to make agent-style workflows substantially faster, a change aimed at reducing the time users spend waiting while tools, models, and API calls bounce back and forth during complex tasks.

In a technical post published April 22, the company described how systems such as Codex can require dozens of sequential requests to complete a single assignment: the model decides what to do next, a tool runs on the client side, the result is sent back to the API, and the cycle repeats. That pattern makes even small amounts of overhead add up quickly.

According to OpenAI, the performance problem became more visible as inference itself sped up. Earlier flagship models in the Responses API ran at about 65 tokens per second, the company said. For GPT-5.3-Codex-Spark, OpenAI targeted more than 1,000 tokens per second using Cerebras hardware. Once model generation became that fast, the slower parts of the loop were no longer easy to hide.

From Inference Bottleneck to API Bottleneck

OpenAI breaks agent latency into three broad stages: API service work, model inference, and client-side time. The client side still matters because tools need to execute and context needs to be assembled, but the company said the API layer itself had become a meaningful bottleneck.

That shift forced a different optimization strategy. Instead of focusing only on GPU throughput, OpenAI says it began removing friction across the request path. Around November 2025, the company launched what it called a performance sprint on the Responses API. The work included caching rendered tokens and model configuration in memory, reducing extra network hops by calling inference services more directly, and speeding up parts of the safety stack so some conversations could be classified faster.

Those changes improved time to first token by nearly 45%, according to the company. But OpenAI says that was still not enough to fully expose the speed gains of its newer inference stack.

OpenAI says "chat is dead" and plans to rebuild ChatGPT as a full-blown agent app

OpenAI Declares 'Chat is Dead,' Plans to Rebuild ChatGPT as a Full-Blown Agent App

OpenAI is overhauling ChatGPT into a 'superapp' that bundles coding tools, AI agents, and partner integrations like Canva and Booking.com, moving beyond simple chat.

Read article

The WebSocket Shift

The larger change was architectural: replacing a series of separate synchronous API calls with a persistent connection to the Responses API using WebSockets. In practical terms, that means the client and the API can stay connected across the full agent loop rather than constantly tearing down and rebuilding request state.

OpenAI says persistent sessions allowed it to keep useful information attached to the connection itself. That reduced repeated setup work and helped the system reuse context more efficiently across turns. The result, the company said, was a roughly 40% improvement in end-to-end agent loop speed.

For users, the significance is straightforward. If a coding or research agent needs many tool calls to finish a job, shaving overhead from every cycle can have a bigger effect than speeding up only one stage. A workflow that once felt stalled between actions can start to feel closer to a live interaction.

What OpenAI Optimized

Connection-scoped caching to avoid repeating expensive setup work.
Fewer unnecessary network hops between API services and inference services.
Faster safety checks in parts of the moderation and classification pipeline.
A persistent WebSocket channel to reduce the cost of many-turn tool use.

OpenAI framed the work as a response to a broader industry change: inference is getting fast enough that surrounding systems increasingly determine perceived product quality. In that environment, a model may be able to think quickly, but the experience can still feel slow if orchestration layers lag behind.

Microsoft Hacked to Deliver Malware to Claude and Gemini Users

Microsoft Shuts Down 70+ GitHub Repos After Hackers Plant Malware Targeting AI Coding Agents

Microsoft disabled over 70 GitHub repositories after hackers injected malware that steals credentials from AI coding tools like Claude Code and Gemini CLI.

Read article

Why This Matters Beyond Codex

Although OpenAI illustrated the problem with Codex, the implications extend to any tool-using agent. Enterprise assistants, customer-service systems, research copilots, and software agents all depend on many small interactions rather than one long model completion. Persistent sessions and lower orchestration overhead could therefore matter just as much as raw benchmark performance.

The post also offers a glimpse into a changing competitive landscape. Model vendors have spent years emphasizing better reasoning and larger context windows. Increasingly, however, they are also competing on systems engineering: throughput, responsiveness, safety latency, and how efficiently a model can stay in the loop with external tools.

OpenAI’s message is that the infrastructure around the model is now a product feature in its own right. If inference speeds continue to rise, that will likely become even more true.

The Bigger Signal

The deeper takeaway is not just that WebSockets are faster than repeated synchronous calls. It is that agent products are maturing into real-time software systems whose performance depends on coordination across APIs, caches, safety layers, and tool runtimes.

That makes this update more than an engineering footnote. It is a sign that the next gains in AI usability may come from reducing friction between model steps, not only from making each individual step smarter. As agentic systems take on longer and more complicated tasks, that distinction could determine whether they feel experimental or operational.

This article is based on reporting by OpenAI. Read the original article.

Microsoft Research's Lens: Detailed Captions Beat Raw Scale for Efficient Image Generation

Microsoft Research unveils Lens, a 3.8B parameter text-to-image model that matches 80B rivals using one-fifth the compute, thanks to 800M detailed captions and smart architecture.

Read article

Originally published on openai.com

OpenAI Says Persistent WebSocket Sessions Cut Agent Loop Latency by Roughly 40%

Why Faster Agent Loops Matter

From Inference Bottleneck to API Bottleneck

OpenAI Declares 'Chat is Dead,' Plans to Rebuild ChatGPT as a Full-Blown Agent App

The WebSocket Shift

What OpenAI Optimized

Microsoft Shuts Down 70+ GitHub Repos After Hackers Plant Malware Targeting AI Coding Agents

Why This Matters Beyond Codex

The Bigger Signal

Microsoft Research's Lens: Detailed Captions Beat Raw Scale for Efficient Image Generation

Comments (0)

Related Articles

Apple Unveils Siri AI with Gemini Integration at WWDC 2026

Microsoft AI CEO Mustafa Suleyman Predicts Superintelligence Is Near, But Says It Won't Replace Human Jobs

AGIBOT World Challenge 2026 Pushes Embodied AI from Simulation to Real-World Tasks

Sakana AI forms lab to pursue recursive self-improvement

Alibaba pushes agentic AI with Qwen3.7-Plus launch

Keep Reading