Curated by THEOUTPOST
On Thu, 24 Oct, 12:06 AM UTC
9 Sources
[1]
Nvidia fixes Blackwell chip flaw with help from TSMC, mass production back on schedule
Serving tech enthusiasts for over 25 years. TechSpot means tech analysis and advice you can trust. What just happened? Nvidia has successfully fixed a design flaw in its latest Blackwell AI chips, according to CEO Jensen Huang. The issue, which caused production delays, has been solved with the assistance of TSMC, Nvidia's long-standing manufacturing partner. In fact, it was TSMC that originally spotted the problem. Overcoming this issue was crucial for Nvidia, as it aims to maintain its dominant position in the AI chip market. As demand for high-performance AI computing solutions continues to surge, the successful launch of Blackwell will play a pivotal role in providing the necessary hardware. Huang candidly admitted the company's responsibility for the setback. "We had a design flaw in Blackwell," he said. "It was functional, but the design flaw caused the yield to be low. It was 100 percent Nvidia's fault." The Blackwell chips, unveiled in March, were originally slated for second-quarter shipping. However, the design flaw led to delays, potentially affecting major customers such as Meta, Google, and Microsoft. The Blackwell project was unusually complex, Huang said, which may have been a factor in the flaw. "In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time." The technical issue stemmed from the intricate packaging technology used in the Blackwell B100 and B200 GPUs. These chips employ TSMC's CoWoS-L packaging, which utilizes an RDL interposer with local silicon interconnect bridges to achieve data transfer rates of about 10 TB/s. The problem arose from a mismatch in thermal expansion properties between various components, causing system warping and failure. To address this, Nvidia modified the top metal layers and bumps of the GPU silicon, enhancing production yields. While specific details of the fix remain undisclosed, the company confirmed that new masks were required. The speed of the resolution is noteworthy. Typically, addressing such issues in the semiconductor industry involves modifying metal layers and creating new steppings, a process that can take around three months. "What TSMC did was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace," Huang said. With the design flaw now resolved, mass production of the fixed Blackwell GPUs is set to begin in late October. Shipments are expected to start in early 2025, aligning with Nvidia's fiscal year. Despite the setback, demand for Blackwell chips remains high. Huang had previously described the demand as "insane," with customers eager to be first in line for the new technology. Google has ordered over 400,000 GB200 chips in a deal exceeding $10 billion. Similarly, Meta has placed a $10 billion order, while Microsoft is set to receive 55,000 to 65,000 GB200 GPUs ready for OpenAI by the first quarter of 2025.
[2]
Nvidia's Jensen Huang admits AI chip design flaw was '100% Nvidia's fault' -- TSMC not to blame, now-fixed Blackwell chips are in production
Nvidia's yield-killing design flaw in its Blackwell GPU was fixed months ago, and a refined version of the B100/B200 processors is about to enter mass production. Jensen Huang, Nvidia's CEO, admitted this week that the flaw was entirely caused by Nvidia and said that the company's production partner TSMC helped fix it in a timely manner, according to Reuters. "We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low," Huang said. "It was 100% Nvidia's fault." When the first reports about the design flaw emerged, some media outlets assumed TSMC was to blame -- and suggested this might be causing strain between Nvidia and its foundry partner. This was not the case, according to Huang, and the problem was caused by Nvidia's own miscalculations. Huang also dismissed reports of tensions between the two companies as "fake news." Nvidia's Blackwell B100 and B200 GPUs link their two chiplets using TSMC's CoWoS-L packaging technology, which relies on an RDL interposer equipped with local silicon interconnect (LSI) bridges (to enable data transfer rates of about 10 TB/s). The placement of these bridges is critical. However, a supposed mismatch in the thermal expansion properties between the GPU chiplets, LSI bridges, RDL interposer, and motherboard substrate caused the system to warp and fail, and Nvidia reportedly had to modify the top metal layers and bumps of the GPU silicon to enhance production yields. While the company did not disclose specific details about the fix, it did mention that new masks were required. Yield-killing problems and major functionality issues (erratas) are not unheard of in the semiconductor world. Typically, companies fix them by modifying a metal layer (or two) and calling it a new stepping. Case in point: Intel's Sapphire Rapids reportedly had 500 bugs, and the company released around a dozen steppings to fix them all. Every new stepping takes around three months to complete (including identifying the problem, fixing it, and producing a new version of the chip), so the speed at which Nvidia and TSMC fixed the Blackwell GPU is pretty impressive. The now-fixed Blackwell GPUs for AI and supercomputers will enter mass production in late October, and should start shipping early next year (which will still be Nvidia's fiscal year 2025). That said, Nvidia disclosed earlier this year that, in order to meet demand for its Blackwell GPUs among major cloud service providers such as AWS, Google, and Microsoft, it will still have to ship some of the initial low-yield Blackwell processors in 2024. It's unclear how many Blackwell GPUs will be shipped to datacenters in 2024.
[3]
"100% Nvidia's fault": Jensen Huang admits Blackwell AI chips had a concerning design flaw
Nvidia CEO Jensen Huang has confirmed a design flaw in its top-end Blackwell AI chips which had affected production was an entirely internal problem, which has now been fixed. "We had a design flaw in Blackwell," Huang said at an event in Copenhagen, Reuters reported. "It was functional, but the design flaw caused the yield to be low. It was 100% Nvidia's fault." First identified in August 2024, the delay to Blackwell B100/B200 processors had raised eyebrows around the world, but Huang reassured that it was Nvidia's own doing that caused the issue. Blackwell chips have been in high demand since Nvidia unveiled the platform earlier in 2024, with Huang describing it as, "the world's most powerful chip," offering previously unheard-of levels of AI computing power. Set to begin shipping in the latter part of 2024, Blackwell binds together two GPU dies, which are connected by 10 TB/second chip-to-chip link into a single, unified GPU. This uses using TSMC's CoWoS-L packaging technology, which relies on an RDL interposer equipped with local silicon interconnect (LSI) bridges that need to be located specifically to allow fast data transfer - the misalignment of which resulted in the issue. Initial media reports had claimed the issue had caused friction with manufacturing partner TSMC, but Huang dismissed the claims as "fake news". "In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time," he said. "What TSMC did, was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace." Blackwell is set to be up to 30x faster than its Grace Hopper predecessor when it comes to AI inference tasks, whilst also reducing cost and energy consumption by up to 25x.
[4]
NVIDIA CEO: 'we had a design flaw in Blackwell, it was 100% NVIDIA's fault' not TSMC's fault
NVIDIA CEO Jensen Huang has addressed the issues surrounding its latest Blackwell AI chip, admitting that there was a design flaw that was "100% NVIDIA's fault" and that TSMC helped them through the tough Blackwell AI GPU launch. NVIDIA initially unveiled its new Blackwell chips at GTC 2024 earlier this year in March, expected to ship in Q2 2024 but were delayed (as you can see in the stories below) which could've affected big-paying customers like Meta, Microsoft, and Google. Huang said: "We had a design flaw in Blackwell. It was functional, but the design flaw caused the yield to be low. It was 100% NVIDIA's fault. In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time". He added: "What TSMC did, was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace". Jensen's comments are hot off the heels of rumors that NVIDIA's issues with its Blackwell AI chips hurt relations with TSMC, and that the GeForce RTX 50 series "Blackwell" gaming GPUs could be fabbed by Samsung. Obviously, that is insane, and now the CEO of NVIDIA has come out and cleared the air... and TSMC from any blame with the design flaws in Blackwell.
[5]
NVIDIA CEO Says Blackwell "Design Flaws" Were 100% On The Company, TSMC Had No Part To Play & Now Back To Full Production
NVIDIA says that Blackwell's design flaws are 100% on them, saying that TSMC had no part to play and that the Taiwan giant sorted the issue out. NVIDIA Verifies That The Firm Indeed Witnessed Blackwell Design Flaws; However, The Issue Was Rectified At An "Incredible Pace" Team Green's Blackwell AI portfolio is one of the most in-demand products in the industry, mainly due to the performance and capabilities it brings onboard. However, a few weeks prior to launch, it was rumored that the architecture had become a victim of design flaws, with the culprit being the packaging technology onboard, and the problem was associated with TSMC's CoWoS, creating the perception that the Taiwan giant was behind NVIDIA's Blackwell flaws. However, in a report by Reuters, NVIDIA's CEO Jensen Huang has verified that Blackwell was indeed encountered with design flaws, but interestingly, TSMC had no part to play in it, and instead, it was "100% NVIDIA's fault". Here is what he had to say: We had a design flaw in Blackwell. It was functional, but the design flaw caused the yield to be low. It was 100% Nvidia's fault. In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production simultaneously. What TSMC did, was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace. - NVIDIA's CEO via Reuters Well, it looks like NVIDIA has resorted to fixing Blackwell production, and interestingly, the blame is off TSMC's shoulders, given that Jensen himself has cleared the Taiwan giant off the list. The firm has sampled multiple chips to make a Blackwell product work out, which shows that the company was indeed faced with troubled yield rates, which would've been much more devastating for NVIDIA's business, but fortunate enough, Blackwell has been rescued. Given that initial Blackwell products are now in the shipping phase, moving into Q4 2024, it will be interesting to see how the architecture turns out for the industry. NVIDIA has said that Blackwell is slated to be the most "successful" product in the company's history. The next phase of NVIDIA's AI hype will surely be interesting, potentially surpassing the hype created by the Hopper generation.
[6]
Cutting through the 'fake news', Nvidia CEO Jen-Hsun Huang says Blackwell's design flaw was '100% Nvidia's fault' and there are no tensions with TSMC
After delays, Nvidia's Blackwell chips are finally shipping to customers, but the roll-out hasn't been completely smooth, as there were recent reports that Nvidia's and TSMC's relationship might be showing signs of stress over some Blackwell chip failures. Now, Nvidia CEO Jen-Hsun Huang has clarified that there aren't any tensions between the two companies and the problem was "100% Nvidia's fault". As reported by Reuters, Huang admits a Blackwell design flaw but lays blame entirely at its own feet. He says "a design flaw with its latest Blackwell AI chips which impacted production has been fixed with the help of longtime Taiwanese manufacturing partner TSMC." According to Huang, Blackwell was "functional, but the design flaw caused the yield to be low. It was 100% Nvidia's fault." And in fact, far from hindering Blackwell's roll-out, "what TSMC did, was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace." So much for the blame game supposedly occurring between the AI and chip giants, then -- reports of tensions between the two companies Huang reportedly called "fake news". And so much for Nvidia eyeing up Samsung for its chip production -- a claim we were rightly very sceptical about. Whatever the original sources were for the initial reports, they appear to have been wrong. All seems rosy in the land of Team Green -- well, besides the still somewhat slowed Blackwell release. Reuters also reports that Nvidia's shares "fell around 2% in early trading", perhaps hinting that this was related to the "fake news". Big picture-wise, though, Nvidia's going from strength to strength and a 2% drop in early trading will matter little. The company's still the hottest thing in the burgeoning AI market, and if the non-fatal Blackwell flaw is now completely fixed as Huang says, there's little to worry about. Which is especially beneficial given Blackwell chips have already started shipping and are already lining some server racks. Yes, there's been a little delay, but now it seems like Blackwell -- which Nvidia thinks could be the "most successful product" in its history -- is ready to churn through the next wave of cloud and AI server workloads. Of course, as PC gamers, we're also excited for what Blackwell might bring in the form of RTX 50-series graphics cards. These are sure to be some of the best graphics cards for gaming upon release, and we're expecting them to launch in early 2025. Combine this with general signs of a booming GPU market, and there might be a lot for us to look forward to.
[7]
'100% Nvidia's Fault': CEO Jensen Huang Says the Company's AI Chip With 'Insane' Demand Had a Crucial Design Flaw
Nvidia's biggest customers are Amazon, Google, Meta, and Microsoft. Nvidia's Blackwell AI chip, the same one that Nvidia CEO Jensen Huang said had "insane" demand, is now free of a design error that caused a production delay. According to a Wednesday Reuters report, Huang said that the design mistake "was 100% Nvidia's fault." "We had a design flaw in Blackwell," he stated. "It was functional, but the design flaw caused the yield to be low." He specified the nature of the problem, stating that "in order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time." After fixing the design flaw, Nvidia has been producing Blackwell "at an incredible pace," Huang said. Related: Here's Why Nvidia Just Broke Another Record and Could Take Apple's Crown as the Most Valuable Company in the World The chips were supposed to ship in the second quarter of this year, but are now shipping in the fourth quarter. Reports that Blackwell could be delayed ramped up in August, causing Nvidia shares to drop. Since then, the stock has climbed back up, growing over 188% year-to-date at the time of writing. Related: Nvidia CEO Jensen Huang Says Nuclear Energy 'Is a Wonderful Way Forward' to Keep AI Data Centers Running Huang has previously said that intense demand was the one thing that kept him up at night and that everyone wanted to be the first to use the Blackwell chip. "We have a lot of people on our shoulders, and everybody is counting on us," he said last month. Snags in Blackwell production affect some of the world's biggest tech companies, which are Nvidia's biggest customers. Over 40% of Nvidia's revenue comes from just four clients: Amazon, Google, Meta, and Microsoft.
[8]
Nvidia CEO says a design flaw in its new AI chip was '100% Nvidia's fault'
Expect the stock market 'fear index' to spike heading into the election, strategist says The chipmaker worked with its partner, Taiwan Semiconductor Manufacturing Company (TSM-0.72%), to resolve the engineering setback in its highly anticipated Blackwell AI platform. "It was functional, but the design flaw caused the yield to be low," Nvidia chief executive Jensen Huang said, according to Reuters. "It was 100% Nvidia's fault." The complexity of the project contributed to the setback. The Blackwell computer involved seven new chip designs that needed to be developed and put into production simultaneously, according to Huang. The problems first surfaced in August, when Nvidia's stock fell around 8% after a report that Blackwell's production was delayed due to a design flaw, possibly setting deliveries back by three or so months and impacting major customers such as Google (GOOGL-0.48%) and Microsoft (MSFT+0.31%). During the company's second-quarter earnings that month, Huang said Nvidia shipped samples of Blackwell to customers and that the AI platform's production would ramp up in the fourth quarter into the next fiscal year. To "improve production yield," Nvidia made a change to Blackwell's GPU mask, the chipmaker said during earnings. However, "there were no functional changes necessary," Huang said on a call with analysts. The company said it expects to "ship several billion dollars in Blackwell revenue" in the fourth quarter. "What TSMC did, was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace," Huang said during an appearance in Denmark to unveil the country's first AI supercomputer, Gefion. The recovery appears successful. In October, Nvidia shares got a boost after Huang said Blackwell was in full production and demand for the chip was "insane." "Everybody wants to have the most, and everybody wants to be first," Huang said during an appearance on CNBC. Nvidia stock closed at a record high of $143.71 per share earlier this week ahead of major technology companies' earnings. The chipmaker's shares were down 2.9% during Wednesday morning trading but are up around 189% so far this year.
[9]
Nvidia's Huang Says Blackwell Problem Fixed
Nvidia CEO Jensen Huang said that a design problem with its new Blackwell AI chip had been fixed, adding that the flaw was "100% Nvidia's fault," Reuters reported. He described reports that the issue had caused tensions between Nvidia and TSMC, the Taiwanese chip giant that manufactures chips for Nvidia as "fake news." Huang appeared to be referring to a report in The Information last
Share
Share
Copy Link
NVIDIA CEO Jensen Huang admits to a design flaw in the company's latest Blackwell AI chips, which caused production delays. The issue has been resolved with TSMC's assistance, and mass production is back on schedule.
NVIDIA's CEO Jensen Huang has publicly addressed a significant design flaw in the company's latest Blackwell AI chips. "We had a design flaw in Blackwell," Huang stated at an event in Copenhagen. "It was functional, but the design flaw caused the yield to be low. It was 100% Nvidia's fault." 3 This admission comes after reports of production delays that potentially affected major customers such as Meta, Google, and Microsoft 1.
The Blackwell B100 and B200 GPUs utilize TSMC's CoWoS-L packaging technology, which employs an RDL interposer with local silicon interconnect bridges to achieve data transfer rates of about 10 TB/s. The flaw stemmed from a mismatch in thermal expansion properties between various components, causing system warping and failure 2.
Contrary to initial reports suggesting tensions between NVIDIA and TSMC, Huang clarified that TSMC played a crucial role in resolving the issue. "What TSMC did was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace," Huang explained 1. He dismissed claims of friction between the two companies as "fake news" 4.
Huang highlighted the complexity of the Blackwell project, stating, "In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time." 3 This complexity may have contributed to the occurrence of the design flaw.
To address the issue, NVIDIA modified the top metal layers and bumps of the GPU silicon, enhancing production yields. While specific details of the fix remain undisclosed, the company confirmed that new masks were required 1. The speed of resolution was noteworthy, as such issues typically take around three months to address in the semiconductor industry 2.
With the design flaw now resolved, mass production of the fixed Blackwell GPUs is set to begin in late October. Shipments are expected to start in early 2025, aligning with NVIDIA's fiscal year 1. Demand for Blackwell chips remains high, with major tech companies placing significant orders. Google has ordered over 400,000 GB200 chips in a deal exceeding $10 billion, while Meta has placed a $10 billion order 1.
The successful resolution of this design flaw is crucial for NVIDIA as it aims to maintain its dominant position in the AI chip market. The Blackwell platform, described by Huang as "the world's most powerful chip," is set to be up to 30x faster than its Grace Hopper predecessor in AI inference tasks, while also reducing cost and energy consumption by up to 25x 3. This advancement is expected to solidify NVIDIA's leadership in the rapidly growing AI computing solutions market.
Reference
[2]
[3]
[4]
NVIDIA's next-generation Blackwell AI chips face delays due to design flaws, potentially affecting major tech companies and the AI industry. The setback could have significant implications for AI development and market competition.
7 Sources
Nvidia's highly anticipated AI chip, codenamed 'Blackwell', has reportedly been delayed due to design flaws. This setback could affect major tech companies like Microsoft, Google, and Meta, who rely on Nvidia's chips for their AI initiatives.
6 Sources
NVIDIA's next-generation Blackwell GPU is set for production ramp-up in Q4 2024. CEO Jensen Huang addresses design challenges and confirms mask change completion, emphasizing the GPU's potential impact on AI advancements.
3 Sources
Nvidia's highly anticipated Blackwell AI GPUs may be delayed, according to industry sources. The setback could impact the AI chip market and Nvidia's dominance in the sector.
2 Sources
NVIDIA CEO Jensen Huang discusses the intense demand for AI chips, causing supply shortages and emotional responses from clients. The company faces challenges in meeting the overwhelming demand for its latest Blackwell GPU architecture.
8 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved