3 Sources
[1]
Nvidia may postpone volume ramp-up of Blackwell machines: TrendForce
Nvidia may have to postpone the volume ramp of next-generation AI servers based on the B200 and GB200 platforms due to overheating, power consumption, and the necessity to optimize interconnections, according to a TrendForce report. The market research firm believes that mass production and peak shipments of Blackwell machines will occur sometime in mid-2025, which means a nearly half-year delay. Nvidia has yet to confirm or deny the claims. As expected, Nvidia and its partners can ship only limited quantities of Blackwell-based servers in 2024, as the company will have to use its low-yielding B200 for them. However, Dell is already shipping Blackwell server racks. However, although refined versions of Nvidia's B200 processors entered mass production in October and, therefore, will get to the company's hands in January, TrendForce does not expect the ramp of Blackwell-based servers to skyrocket immediately. According to the firm, due to overheating, power consumption, and requirements for higher-speed interconnects, mass production and peak shipments of B200 and GB200 will occur only between the second and the third quarter of 2025. Just several months ago, it was reported that an Nvidia NVL72 rack based on the GB200 platform with 72 B200 GPUs would consume 120 kW of power, which already is significantly higher than current AI server racks (typical high-density rack power is up to 20kW, while an H100-based rack reportedly consumes around 40kW). TrendForce now claims that Nvidia had updated the specification of the device, and now it consumes 140 kW, which is more than typical data centers can provide to a single rack. The problem is that Nvidia's Blackwell GPUs were reportedly prone to overheating in servers equipped with 72 processors even when the racks consumed up to 120 kW per rack. This issue has forced Nvidia to repeatedly revise its server rack designs, as overheating not only reduces GPU performance but also risks hardware damage. A 140 kW per rack power consumption means further alterations to server designs, which could result in setbacks. Increased power consumption means additional cooling requirements. Liquid cooling is essential for Blackwell servers, but modern sidecar coolant distribution units (CDUs) can only handle 60 kW -- 80 kW of thermal power. To that end, cooling system providers are optimizing cold plate designs and aiming to double or triple the capacity of CDUs. TrendForce expects the performance of liquid-to-liquid in-row CDUs to exceed 1.3 mW, with further advancements possible, so excessive heat dissipation will eventually cease to be a major problem. However, according to the report, power consumption and heat management are not the only issues that Nvidia and its partners have to solve. TrendForce claims that Nvidia has to optimize its interconnections but doesn't elaborate on which interconnections must be optimized. It remains to be seen how the claimed teething problems with Nvidia's B200 and GB200 servers affect the launch timeframe and availability of B200A based on simplified Blackwell processors and the B300 and GB300 machines featuring refreshed Blackwell GPUs. While B200A will likely feature a considerably lower power consumption compared to B200/GB200, the refreshed B300-series Blackwell GPUs promise to come with more memory and feature higher compute performance, which usually comes at higher power, so these products will likely consume even more than 140 kW per rack, necessitating even more sophisticated components and cooling.
[2]
NVIDIA GB200 AI server mass production, peak shipments could be delayed until Q2 or Q3 2025
NVIDIA's new GB200 rack-mounted AI servers are still experiencing issues, with a new report suggesting that the supply chain requires more time for optimization and adjustment, and it could see mass production and peak shipments delayed until Q2 or even Q3 2025. In a new report from TrendForce, their latest report says that the supply chain needs more time for GB200 rack servers, and that it's mostly because of the higher design specifications of the GB200 rack, including its requirement for high-speed interconnect interfaces, and thermal design power (TDP), which "significantly exceed market norms". TrendForce is now projecting that mass production and peak shipments of NVIDIA GB200 rack servers are "unlikely to occur until between Q2 and Q3 of 2025". The NVIDIA GB rack series includes the GB200 and new GB300 models, which feature even more complex tech and higher production costs. NVIDIA's new GB200 NVL72 AI server is expected to become "the most widely adopted model in 2025" which will potentially account for up to 80% of total deployments as NVIDIA ranks up its push into the market with GB200. The high-speed interconnect issue stems from NVIDIA's in-house NVLink connectivity (the high-speed connection between GPUs) with GB200 using fifth-generation NVLink and offers significantly higher total bandwidth than the current industry standard of PCIe 5.0. TrendForce notes that the TDP of the 2024-dominant HGX AI server typically ranges from between 60kW to 80kW per rack, but the new GB200 NVL72 AI server can reach an insane 140kW per rack, which is close to doubling the power demands over current racks. CSPs (cloud service providers) are pushing the adoption of liquid-cooling solutions over air-cooling solutions, because air cooling is no longer enough for the higher thermal loads.
[3]
NVIDIA's GB200 rack needs more supply chain optimization, mass production expected in Q2 and Q3 of 2025 By Investing.com
Investing.com -- The NVIDIA (NASDAQ:NVDA) GB200 rack-mounted solution requires further optimization and adjustment in its supply chain, according to recent research by TrendForce. The complex design specifications of the GB200 rack, including high-speed interconnect interfaces and thermal design power (TDP) requirements that exceed market norms, are the primary reasons for this need. As a result, TrendForce predicts that mass production and peak shipments will likely take place between Q2 and Q3 of 2025. The NVIDIA GB rack series, which includes the GB200 and GB300 models, is characterized by complex technology and higher production costs. This makes it a preferred solution for large Cloud Service Providers (CSPs) and other potential users such as Tier-2 data centers, national sovereign cloud providers, and academic research institutions working on High-Performance Computing (HPC) and Artificial Intelligence (AI) applications. The GB200 NVL72 model is expected to be the most popular in 2025, possibly accounting for up to 80% of total deployments as NVIDIA increases its market efforts. NVIDIA's proprietary NVLink technology is integral to the company's strategy to enhance the computational performance of AI and HPC server systems. This technology allows for high-speed connections between GPU chips. The GB200 uses the fifth-generation NVLink, providing a total bandwidth that significantly surpasses the current industry standard, PCIe 5.0. The TDP of the HGX AI server, which dominated in 2024, typically ranges from 60 kW to 80 kW per rack. However, the GB200 NVL72's TDP reaches 140 kW per rack, doubling power requirements. This has led manufacturers to speed up the adoption of liquid cooling solutions, as traditional air cooling methods cannot handle such high thermal loads. The advanced design requirements for the GB200 have raised concerns about possible delays in component availability and system shipments. TrendForce states that the production of Blackwell GPU chips is progressing mostly as planned, with only limited shipments expected in 4Q24. Production volume is expected to increase gradually from 1Q25 onwards. However, due to ongoing supply chain adjustments for the AI server system components, shipments at the end of 2024 are expected to be lower than industry expectations. As a result, TrendForce predicts that the peak shipment period for the GB200 full-rack system will be delayed to between Q2 and Q3 of 2025. The GB200 NVL72's TDP of 140 kW has made liquid cooling essential, as it surpasses the capabilities of traditional air-cooled solutions. The adoption of liquid-cooling components is gaining momentum, with leading industry players investing heavily in research and development for liquid cooling technologies. Notably, suppliers of coolant distribution units are striving to improve cooling efficiency by increasing rack sizes and developing more efficient cold plate designs. Current sidecar CDUs can dissipate between 60 kW and 80 kW, but future designs are expected to double or even triple this cooling capacity. The development of liquid-to-liquid in-row CDU systems has allowed cooling performance to exceed 1.3 mW, with further improvements expected as computational power demands continue to grow.
Share
Copy Link
Nvidia's next-generation Blackwell AI servers, including the GB200 and GB300 models, may experience delays in mass production and peak shipments until mid-2025 due to overheating, power consumption, and interconnection optimization issues.
Nvidia, the leading AI chip manufacturer, may be facing significant challenges with its next-generation Blackwell AI servers. According to a report by TrendForce, the mass production and peak shipments of Blackwell machines, including the B200 and GB200 platforms, could be postponed until mid-2025, representing a delay of nearly six months 12.
The primary issues causing the potential delay are:
Overheating: The Blackwell GPUs are reportedly prone to overheating in servers equipped with 72 processors, even at high power consumption levels 1.
Power Consumption: The power requirements for Blackwell-based servers have increased significantly. An Nvidia NVL72 rack based on the GB200 platform with 72 B200 GPUs is now expected to consume 140 kW of power, up from the previously reported 120 kW 12.
Interconnection Optimization: TrendForce claims that Nvidia needs to optimize its interconnections, particularly the high-speed NVLink technology used for GPU-to-GPU communication 3.
The extreme power consumption of Blackwell servers necessitates advanced cooling solutions:
The potential delay could have significant implications for the AI hardware market:
The AI industry is adapting to the challenges posed by these high-performance servers:
As Nvidia works to overcome these technical hurdles, the AI hardware landscape continues to evolve, with power efficiency and thermal management becoming increasingly critical factors in the development of next-generation AI infrastructure.
French tech giant Capgemini agrees to acquire US-listed WNS Holdings for $3.3 billion, aiming to strengthen its position in AI-powered intelligent operations and expand its presence in the US market.
10 Sources
Business and Economy
7 hrs ago
10 Sources
Business and Economy
7 hrs ago
Isomorphic Labs, a subsidiary of Alphabet, is preparing to begin human trials for drugs developed using artificial intelligence, potentially revolutionizing the pharmaceutical industry.
3 Sources
Science and Research
15 hrs ago
3 Sources
Science and Research
15 hrs ago
BRICS leaders are set to call for protections against unauthorized AI use, addressing concerns over data collection and fair payment mechanisms during their summit in Rio de Janeiro.
3 Sources
Policy and Regulation
23 hrs ago
3 Sources
Policy and Regulation
23 hrs ago
Huawei's AI research division, Noah Ark Lab, denies allegations that its Pangu Pro large language model copied elements from Alibaba's Qwen model, asserting independent development and adherence to open-source practices.
3 Sources
Technology
7 hrs ago
3 Sources
Technology
7 hrs ago
Samsung Electronics is forecasted to report a significant drop in Q2 operating profit due to delays in supplying advanced memory chips to AI leader Nvidia, highlighting the company's struggles in the competitive AI chip market.
2 Sources
Business and Economy
15 hrs ago
2 Sources
Business and Economy
15 hrs ago