OpenAI Releases MRC Networking Protocol for Large AI Training Clusters

A new networking layer for larger AI clusters

OpenAI has introduced Multipath Reliable Connection, or MRC, a networking protocol designed for large-scale AI training systems where delays between GPUs can slow an entire job. The company said it developed the protocol with AMD, Broadcom, Intel, Microsoft, and NVIDIA, then released the specification through the Open Compute Project so other operators can adopt it.

The move targets one of the less visible bottlenecks in frontier-model development. Training runs depend on huge volumes of data moving among accelerators, and a single late transfer can leave expensive hardware waiting idle. OpenAI argues that as clusters grow, congestion, link failures, and routing problems become frequent enough that network design itself becomes a core determinant of training speed and reliability.

What MRC is meant to fix

In its description of the system, OpenAI said the protocol is built around three ideas: multi-plane high-speed networks for redundancy, adaptive packet spraying to reduce core congestion, and static source routing to work around failures. The company framed these choices as a way to remove complexity while improving resilience.

The basic problem is scale. A modern training step can require millions of data transfers across a supercomputer fabric. If a network path gets congested or a device fails, that disruption can ripple outward and stall synchronized work across many GPUs. OpenAI said MRC is intended to prevent those issues from spreading by distributing traffic more effectively and by allowing failures to be bypassed without relying on more fragile routing behavior.

Three core design choices

Multi-plane networking is intended to provide redundancy while using fewer components and less power than some alternatives.
Adaptive packet spraying spreads traffic across paths to reduce hot spots in the network core.
Static source routing is used in deployment to bypass failures and avoid classes of routing failures altogether.

OpenAI Pushes Into Custom AI Hardware With Jalapeño

OpenAI and Broadcom say they have built a custom inference chip called Jalapeño, marking OpenAI’s first move into purpose-built hardware for running large language models.

Read article

Why this matters beyond one company

OpenAI tied the release to its broader compute strategy and to the demands of Stargate-scale infrastructure. The company said shared standards at key infrastructure layers can help AI systems scale more efficiently across a wider partner ecosystem. Releasing the specification through OCP also signals that networking design for AI clusters is starting to be treated as a shared industry problem rather than a private implementation detail.

That matters because the economics of model training are shaped not only by chips and power, but by how effectively operators can keep clusters busy. A protocol that cuts jitter and makes failures easier to route around could improve utilization across large deployments, which in turn affects how quickly new models can be trained and how much infrastructure must be built to reach a given target.

The partner list also underscores how broad the problem has become. With semiconductor vendors, cloud infrastructure operators, and system builders all involved, the release suggests that AI networking is becoming its own important competitive layer. The fact that OpenAI chose an open specification rather than a purely proprietary approach implies a bet that interoperability and ecosystem adoption are now more valuable than keeping this part of the stack closed.

The bigger infrastructure signal

The announcement is notable less because of a single protocol feature than because it shows where AI infrastructure pressure is accumulating. For years, the public conversation around model scaling has focused on GPUs. MRC highlights the next-order constraint: once accelerator counts become massive, the network between them can decide whether theoretical compute actually translates into useful work.

OpenAI is effectively arguing that the path to larger and more reliable training systems runs through simpler, more failure-tolerant network fabrics. If MRC performs as described in real deployments, it could help set expectations for how future hyperscale AI clusters are built. At minimum, it marks another step in the industrialization of AI infrastructure, where advances increasingly come from systems engineering as much as from model architecture.

This article is based on reporting by OpenAI. Read the original article.

OpenAI tripled revenue to $5.7 billion in Q1 but burned through $3.7 billion to get there

OpenAI revenue surged in Q1 as losses stayed steep

OpenAI’s first-quarter figures point to rapid commercial growth, but also to a cost structure that remains extraordinarily heavy even as margins improve.

Read article

Originally published on openai.com