MRC Protocol: OpenAI, Nvidia, AMD Transform AI Training

OpenAI Releases MRC Protocol Through Open Compute Project (OCP)

OpenAI has unveiled Multipath Reliable Connection (MRC), a production-proven RDMA transport protocol designed to eliminate network bottlenecks in AI training environments. Developed over two years in collaboration with Nvidia, AMD, Intel, Microsoft, and Broadcom, MRC has been released through the Open Compute Project to facilitate broader adoption across the AI industry 2

. The protocol is already operational in some of the world's most demanding AI environments, including OpenAI's training infrastructure for ChatGPT and Codex, as well as Microsoft's Fairwater supercomputers housing Nvidia GB200 Blackwell GPUs 1

. This isn't a lab experiment but a set of algorithms that has earned its place powering frontier large language models in production.

Source: CXOToday

Solving Critical Bottlenecks in Large-Scale AI Training

When training frontier models across tens or hundreds of thousands of GPUs, even a single data transfer arriving late can disrupt the entire process, causing expensive compute resources to sit idle during multimillion-dollar training runs 1

. The primary culprits are network congestion, link failures, and device disruptions that become increasingly common as cluster size grows 2

. MRC addresses these challenges by distributing data across multiple paths simultaneously rather than relying on a single path. This approach to distributing data across multiple paths reduces congestion hotspots and limits latency variation that can slow synchronized training operations 3

. When failures inevitably occur, MRC adapts quickly, allowing traffic to reroute in microseconds and avoiding the delays associated with traditional network recovery mechanisms 2

Source: SiliconANGLE

AI Networking at Scale Through Multiplane Architecture

The protocol leverages a multiplane network design that fundamentally reshapes how gigascale AI factories are constructed. Instead of treating each network interface as one 800Gb/s link, MRC splits it into multiple smaller links. One interface can connect to eight different switches, enabling the construction of eight separate parallel networks or planes, each operating at 100Gb/s rather than a single 800Gb/s network 2

. This architectural shift has profound implications for cluster topology. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s, allowing a network to fully connect approximately 131,000 GPUs with only two tiers of switches, compared to the three or four tiers required by conventional 800Gb/s networks 2

. Nvidia's Spectrum-X provides hardware-accelerated load balancing across these planes, keeping latency predictable while absorbing failures or maintenance events by shifting traffic between planes without disrupting training jobs 1

Source: Wccftech

Host-Driven Routing and Smart Tenant Control

MRC extends the routing intelligence all the way to the host level, marking a major shift from classical Ethernet designs where hosted tenants have little control over the network fabric. According to Nvidia Senior Vice President Gilad Shainer, the network interface card and host-side management stack can actively participate in routing decisions, overriding or influencing what switches do 1

. OpenAI wanted to operate as a "smart tenant" with the ability to govern routing policy, congestion control, and failure behavior from the server edge. This capability is particularly valuable in hyperscale environments where traditional cloud models leave the network fabric opaque to customers who only have visibility at the virtual machine or server level 1

. The protocol builds on RoCEv2 as defined by the InfiniBand Trade Association, then extends it with multipath capabilities and host-driven governance 1

AMD's Programmability Advantage in GPU Networking Performance

AMD played a formative role in developing MRC, co-leading authorship of the specification and contributing advanced congestion control technology 3

. The company has implemented and deployed MRC at scale in test clusters with a leading cloud provider, validating that the design performs under sustained AI workloads 3

. Krishna Doddapaneni, CVP of Engineering at AMD, noted that "as GPUs and CPUs continue to drive compute, real bottleneck in scaling AI is the network" 3

. AMD's programmability differentiator stems from its AMD Pensando Pollara 400 AI NIC, which enabled a pre-standard implementation of an improved RoCEv2 transport protocol that evolved into today's MRC standard. This programmability contributed to early validation flexibility, positioning AMD as one of the first companies to implement MRC on a 400G NIC and enabling a seamless transition to the AMD Pensando Vulcano 800G AI NIC 3

AI-Native Ethernet Fabrics and Industry Implications

Nvidia is careful to position MRC as "another protocol" on Spectrum-X rather than a replacement for existing approaches. The platform currently supports at least two main Ethernet transports for AI: Spectrum-X plus adaptive RDMA for general-purpose AI Ethernet with adaptive routing, and Spectrum-X with MRC emphasizing multipath and host-driven routing 1

. Shainer offered a pragmatic view on the relationship between MRC and the Ultra Ethernet Consortium, a multivendor effort to define new AI-native Ethernet fabrics. He expects more variety rather than a single winner, with different hyperscalers and AI providers tuning transport protocols to their own workloads and operational models 1

. The protocol will be fundamental to OpenAI's Stargate supercomputers, built by Oracle Cloud Infrastructure in Abilene, Texas, which aims to deploy 10GW of AI compute by 2029 and has already deployed over 3GW in the past three months 2

. With MRC now available to the entire AI industry through OCP, it paves the way for cross-industry collaboration in solving the hardest problems in AI infrastructure.

OpenAI, Nvidia, AMD and Microsoft unveil MRC protocol to accelerate large-scale AI training

OpenAI Releases MRC Protocol Through Open Compute Project (OCP)

Solving Critical Bottlenecks in Large-Scale AI Training

AI Networking at Scale Through Multiplane Architecture

Host-Driven Routing and Smart Tenant Control

AMD's Programmability Advantage in GPU Networking Performance

AI-Native Ethernet Fabrics and Industry Implications

References

Nvidia's MRC: When 'just Ethernet' isn't enough for gigascale AI - SiliconANGLE

OpenAI Got The Whole AI Squad To Accelerate Large-Scale AI Training - AMD, NVIDIA, Intel, Microsoft & Broadcom All-In On MRC

From Innovation to Deployment Ready: AMD Advances AI Networking at Scale with MRC

Related Stories

Tech giants form optical interconnect alliance to solve AI infrastructure's data bottleneck

Microsoft Unveils World's First AI Superfactory, Linking Data Centers Across 700 Miles

NVIDIA's Spectrum-X Ethernet Revolutionizes AI Data Centers for Meta and Oracle

Recent Highlights

Pope Leo XIV releases first AI encyclical calling for disarmament from monopolistic control

Trump cancels AI executive order signing after tech CEOs skip event and industry pushback

Google AI Search officially replaces traditional web search with Gemini-powered conversations

Recent Highlights

Today's Top Stories

Anthropic launches Claude Opus 4.8 as Mythos-class AI model nears public release

Anthropic becomes most valuable AI startup at $965B valuation, surpassing OpenAI

Apple turns to Google Gemini and Nvidia chips as on-device AI struggles with trillion-parameter models

CNN sues Perplexity AI for allegedly copying over 17,000 pieces of content without permission