Curated by THEOUTPOST
On Thu, 21 Nov, 12:07 AM UTC
2 Sources
[1]
Nvidia May Have Already Fixed Blackwell's Cooling Issues
Earlier this week, a report from The Information said Nvidia's Blackwell AI chips were delayed due to an overheating issue that developed when they were placed on server racks, but a third-party research firm claims the problem is overblown and was fixed months ago. Blackwell, which is designed for businesses looking to build out their AI data centers, has server racks that can fit up to 72 GPUs. But Semianalysis, a research firm that focuses on the semiconductor and AI industries, tells Business Insider that Nvidia suppliers reworked the server racks with "minor" changes to address the problem. According to the firm's chief analyst, cooling may be a concern in the future, but the specific server issue in question has been fixed. Nvidia said earlier this week that it is "working with leading cloud service providers as an integral part of our engineering team and process" regarding any potential issues and that "engineering iterations are normal and expected." Overheating GPUs can throttle performance and cause operational issues. Their immediate surroundings (like the number of nearby fans, type of case, or rack design) can directly impact GPU temperature, as well. The GPU's design can also result in higher average temperatures depending on the specific model. But Georgia Tech Professor Bara Cola -- who is also the founder of Carbice, which develops thermal computing solutions -- argues that heat itself isn't Blackwell's biggest challenge. "The real challenge is mechanical stress and not heat. I am confident that Nvidia will find a way to operate these chips for their customers. High-performance chips like this will always run hot, and it is just a matter of balancing how hot -- smart engineers will solve this," Cola tells PCMag via email. "But early failure happens when the interfaces cannot handle the thermal expansion stress that the heat brings. This is a hard materials science problem." Blackwell previously had a "design flaw" unrelated to the server overheating issue. Nvidia CEO Jensen Huang has also said this has since been resolved.
[2]
Nvidia's Blackwell AI GPU overheating issues are seemingly overhyped -- semiconductor analysts reveal cooling issues have been mostly addressed
Blackwell's cooling issues aren't as severe as some might have thought. Reports of Nvidia's GB200 NVL72 server racks overheating have purportedly been exaggerated. Business Insider reports that Blackwell's cooling design faults have already been addressed. Dylan Patel, chief analyst at Semianalysis, purportedly told Business Insider that Blackwell's design issues, which have been present for months, have been largely addressed, stating that the overheating issues are largely overblown. Semianalysis' five analysts monitoring the semiconductor industry reported that the cooling system issues triggering "reworks" from several suppliers were a "minor" change. Blackwell's cooling faults have been specifically problematic with Nvidia's massive 72-chip server rack, which can consume up to 120kW. Design flaws in the rack's design have forced Nvidia to reevaluate its design multiple times due to the GPUs inside overheating. This has setback shipments of Nvidia's GB200 hardware, causing additional delays due to the required design changes. Nvidia's B200 GPUs are the most potent processing chips for AI workloads. The GB200 superchip, for instance, has a configurable TDP in the thousands of watts, with a peak rating of up to 2,700 watts. These absurdly high power figures make air cooling virtually impossible to use in the constraints of a standard rack mount form factor. This physics problem has forced Nvidia to require liquid cooling on its latest Blackwell GPUs. It also requires data centers to revamp their server farms to accommodate the infrastructure needed to support liquid-cooled servers. Nvidia could solve this problem by creating slower air-cooled GPUs -- which the GPU manufacturer still does, in the form of GPUs such as the H200 NVL. However, to remain at the bleeding edge of the AI GPU arms race, Nvidia is prioritizing performance no matter the cost, which is why the company has opted to make GPUs that require thousands of watts of power at the expense of air-cooling. The good news is that Nvidia's 72-chip Blackwell cooling issues are apparently minor and have been largely addressed already. In addition, only Nvidia's flagship 72-chip server rack is having the problem.
Share
Share
Copy Link
Reports of overheating problems with Nvidia's Blackwell AI chips have been exaggerated, according to industry analysts. The company has reportedly addressed the cooling issues in its high-performance server racks.
Nvidia, the leading AI chip manufacturer, has reportedly resolved cooling issues with its latest Blackwell AI GPUs, contrary to recent reports suggesting significant delays due to overheating problems. Industry analysts claim that the concerns have been largely exaggerated and that Nvidia has already implemented solutions to address these challenges 12.
Earlier reports from The Information suggested that Nvidia's Blackwell AI chips were facing delays due to overheating issues when placed on server racks. The GB200 NVL72 server racks, capable of housing up to 72 GPUs, were said to be particularly affected by these thermal challenges 1.
Dylan Patel, chief analyst at Semianalysis, a research firm focusing on the semiconductor and AI industries, told Business Insider that Nvidia suppliers have already reworked the server racks with "minor" changes to address the problem. According to Patel, while cooling may be a concern in the future, the specific server issue in question has been largely resolved 12.
Nvidia has acknowledged that they are "working with leading cloud service providers as an integral part of our engineering team and process" to address any potential issues. The company emphasized that "engineering iterations are normal and expected" in the development of cutting-edge technology 1.
The Blackwell GPUs, designed for businesses building AI data centers, are incredibly powerful and consume massive amounts of energy. The GB200 superchip, for instance, has a configurable TDP of up to 2,700 watts, making air cooling virtually impossible within standard rack mount form factors 2.
To manage the extreme heat generated by these high-performance chips, Nvidia has opted for liquid cooling solutions. This decision requires data centers to revamp their infrastructure to accommodate liquid-cooled servers, highlighting the company's commitment to prioritizing performance over conventional cooling methods 2.
Professor Bara Cola from Georgia Tech, who is also the founder of Carbice, a thermal computing solutions company, offers a different perspective on the challenges facing Blackwell:
"The real challenge is mechanical stress and not heat. I am confident that Nvidia will find a way to operate these chips for their customers. High-performance chips like this will always run hot, and it is just a matter of balancing how hot -- smart engineers will solve this," Cola explained 1.
It's worth noting that Blackwell had previously faced a "design flaw" unrelated to the server overheating issue. Nvidia CEO Jensen Huang has stated that this earlier problem has since been resolved, demonstrating the company's ability to address and overcome technical challenges 1.
As the AI GPU arms race continues, Nvidia's approach of prioritizing performance at the expense of conventional cooling methods underscores the company's commitment to maintaining its position at the forefront of AI technology. While challenges remain, the reported resolution of the cooling issues suggests that Nvidia is well-equipped to handle the complex engineering demands of next-generation AI hardware.
NVIDIA's latest GB200 AI servers are at the center of controversy, with reports of overheating issues and order reductions from major tech companies. Taiwanese suppliers deny these claims, while the industry grapples with the transition to liquid cooling technology.
6 Sources
6 Sources
Nvidia's next-generation Blackwell AI servers, including the GB200 and GB300 models, may experience delays in mass production and peak shipments until mid-2025 due to overheating, power consumption, and interconnection optimization issues.
3 Sources
3 Sources
NVIDIA CEO Jensen Huang admits to a design flaw in the company's latest Blackwell AI chips, which caused production delays. The issue has been resolved with TSMC's assistance, and mass production is back on schedule.
9 Sources
9 Sources
NVIDIA prepares to launch its next-generation Blackwell GB200 AI servers in December, with major cloud providers like Microsoft among the first recipients. This move aims to address supply issues and meet the growing demand for AI computing power.
3 Sources
3 Sources
Nvidia's highly anticipated Blackwell AI GPUs may be delayed, according to industry sources. The setback could impact the AI chip market and Nvidia's dominance in the sector.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved