Microsoft Azure Unveils World's First NVIDIA GB300 NVL72 Supercomputer Cluster for AI Workloads

Reviewed byNidhi Govil

5 Sources

Share

Microsoft Azure has deployed the industry's first supercomputing-scale production cluster using NVIDIA's GB300 NVL72 systems, specifically designed for OpenAI's demanding AI workloads. This groundbreaking infrastructure marks a significant milestone in AI computing capabilities.

Microsoft Azure's Groundbreaking AI Infrastructure

Microsoft Azure has made a significant leap in artificial intelligence computing capabilities with the introduction of the world's first supercomputing-scale production cluster using NVIDIA's GB300 NVL72 systems

1

2

. This cutting-edge infrastructure, known as the NDv6 GB300 VM series, is specifically designed to handle OpenAI's most demanding AI inference workloads.

Source: NVIDIA Blog

Source: NVIDIA Blog

Unprecedented Scale and Performance

The new cluster boasts over 4,600 NVIDIA Blackwell Ultra GPUs, interconnected via the NVIDIA Quantum-X800 InfiniBand networking platform

2

. Each rack in this system contains 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs, delivering an astounding 1.44 exaflops of FP4 Tensor Core performance per VM

1

.

Source: NDTV Gadgets 360

Source: NDTV Gadgets 360

Advanced Networking and Memory Capabilities

The cluster's impressive performance is made possible by its advanced networking architecture. Within each rack, the fifth-generation NVIDIA NVLink Switch fabric provides 130 TB/s of direct, all-to-all bandwidth between the 72 Blackwell Ultra GPUs

2

. This creates a unified accelerator with a shared memory pool of 37 terabytes of fast memory, crucial for handling massive, memory-intensive AI models

3

.

Scaling Beyond the Rack

To enable communication across the entire cluster, Microsoft Azure employs the NVIDIA Quantum-X800 InfiniBand platform. This provides 800 Gb/s of bandwidth per GPU, ensuring seamless interaction among all 4,608 GPUs

2

. The implementation of a full fat-tree, non-blocking network architecture allows for efficient scaling of AI training across tens of thousands of GPUs while minimizing communication delays

3

.

Impact on AI Development

This new infrastructure is set to revolutionize AI model development and deployment. Microsoft claims that the cluster will enable model training in weeks instead of months and support the training of models with hundreds of trillions of parameters

3

4

. The system is optimized for reasoning models, agentic AI systems, and multimodal generative AI workloads, paving the way for more advanced and responsive AI applications.

Cooling and Power Innovations

To support such a high-performance system, Microsoft has implemented custom liquid cooling solutions. The cluster uses standalone heat exchanger units along with facility cooling to reduce water usage while maintaining thermal stability

3

. Additionally, new power distribution models have been developed to support the high energy density and dynamic load balancing required by these advanced GPU clusters

4

.

Future Implications

This deployment marks a significant milestone in the development of AI infrastructure, reinforcing the United States' leadership in next-generation AI technologies

5

. As Microsoft Azure aims to scale to hundreds of thousands of NVIDIA Blackwell Ultra GPUs across various data centers globally, we can expect further innovations and breakthroughs in AI capabilities, particularly from customers like OpenAI

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo