OpenAI, Nvidia, AMD and Microsoft unveil MRC protocol to accelerate large-scale AI training

3 Sources

Share

OpenAI has released Multipath Reliable Connection (MRC), a new networking protocol developed with Nvidia, AMD, Intel, Microsoft and Broadcom, through the Open Compute Project. The protocol addresses critical bottlenecks in AI training by distributing data across multiple paths, enabling networks to support up to 131,000 GPUs with microsecond failure recovery. MRC is already deployed in OpenAI's frontier model training and Microsoft's GB200-based AI factories.

OpenAI Releases MRC Protocol Through Open Compute Project (OCP)

OpenAI has unveiled Multipath Reliable Connection (MRC), a production-proven RDMA transport protocol designed to eliminate network bottlenecks in AI training environments. Developed over two years in collaboration with Nvidia, AMD, Intel, Microsoft, and Broadcom, MRC has been released through the Open Compute Project to facilitate broader adoption across the AI industry

2

. The protocol is already operational in some of the world's most demanding AI environments, including OpenAI's training infrastructure for ChatGPT and Codex, as well as Microsoft's Fairwater supercomputers housing Nvidia GB200 Blackwell GPUs

1

2

. This isn't a lab experiment but a set of algorithms that has earned its place powering frontier large language models in production.

Source: CXOToday

Source: CXOToday

Solving Critical Bottlenecks in Large-Scale AI Training

When training frontier models across tens or hundreds of thousands of GPUs, even a single data transfer arriving late can disrupt the entire process, causing expensive compute resources to sit idle during multimillion-dollar training runs

1

. The primary culprits are network congestion, link failures, and device disruptions that become increasingly common as cluster size grows

2

. MRC addresses these challenges by distributing data across multiple paths simultaneously rather than relying on a single path. This approach to distributing data across multiple paths reduces congestion hotspots and limits latency variation that can slow synchronized training operations

3

. When failures inevitably occur, MRC adapts quickly, allowing traffic to reroute in microseconds and avoiding the delays associated with traditional network recovery mechanisms

2

.

Source: SiliconANGLE

Source: SiliconANGLE

AI Networking at Scale Through Multiplane Architecture

The protocol leverages a multiplane network design that fundamentally reshapes how gigascale AI factories are constructed. Instead of treating each network interface as one 800Gb/s link, MRC splits it into multiple smaller links. One interface can connect to eight different switches, enabling the construction of eight separate parallel networks or planes, each operating at 100Gb/s rather than a single 800Gb/s network

2

. This architectural shift has profound implications for cluster topology. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s, allowing a network to fully connect approximately 131,000 GPUs with only two tiers of switches, compared to the three or four tiers required by conventional 800Gb/s networks

2

. Nvidia's Spectrum-X provides hardware-accelerated load balancing across these planes, keeping latency predictable while absorbing failures or maintenance events by shifting traffic between planes without disrupting training jobs

1

.

Source: Wccftech

Source: Wccftech

Host-Driven Routing and Smart Tenant Control

MRC extends the routing intelligence all the way to the host level, marking a major shift from classical Ethernet designs where hosted tenants have little control over the network fabric. According to Nvidia Senior Vice President Gilad Shainer, the network interface card and host-side management stack can actively participate in routing decisions, overriding or influencing what switches do

1

. OpenAI wanted to operate as a "smart tenant" with the ability to govern routing policy, congestion control, and failure behavior from the server edge. This capability is particularly valuable in hyperscale environments where traditional cloud models leave the network fabric opaque to customers who only have visibility at the virtual machine or server level

1

. The protocol builds on RoCEv2 as defined by the InfiniBand Trade Association, then extends it with multipath capabilities and host-driven governance

1

.

AMD's Programmability Advantage in GPU Networking Performance

AMD played a formative role in developing MRC, co-leading authorship of the specification and contributing advanced congestion control technology

3

. The company has implemented and deployed MRC at scale in test clusters with a leading cloud provider, validating that the design performs under sustained AI workloads

3

. Krishna Doddapaneni, CVP of Engineering at AMD, noted that "as GPUs and CPUs continue to drive compute, real bottleneck in scaling AI is the network"

3

. AMD's programmability differentiator stems from its AMD Pensando Pollara 400 AI NIC, which enabled a pre-standard implementation of an improved RoCEv2 transport protocol that evolved into today's MRC standard. This programmability contributed to early validation flexibility, positioning AMD as one of the first companies to implement MRC on a 400G NIC and enabling a seamless transition to the AMD Pensando Vulcano 800G AI NIC

3

.

AI-Native Ethernet Fabrics and Industry Implications

Nvidia is careful to position MRC as "another protocol" on Spectrum-X rather than a replacement for existing approaches. The platform currently supports at least two main Ethernet transports for AI: Spectrum-X plus adaptive RDMA for general-purpose AI Ethernet with adaptive routing, and Spectrum-X with MRC emphasizing multipath and host-driven routing

1

. Shainer offered a pragmatic view on the relationship between MRC and the Ultra Ethernet Consortium, a multivendor effort to define new AI-native Ethernet fabrics. He expects more variety rather than a single winner, with different hyperscalers and AI providers tuning transport protocols to their own workloads and operational models

1

. The protocol will be fundamental to OpenAI's Stargate supercomputers, built by Oracle Cloud Infrastructure in Abilene, Texas, which aims to deploy 10GW of AI compute by 2029 and has already deployed over 3GW in the past three months

2

. With MRC now available to the entire AI industry through OCP, it paves the way for cross-industry collaboration in solving the hardest problems in AI infrastructure.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved