Curated by THEOUTPOST
On Tue, 29 Oct, 12:08 AM UTC
13 Sources
[1]
Elon Musk's xAI to double Colossus AI supercomputer power to 200K NVIDIA Hopper AI GPUs
Elon Musk's xAI startup is currently in the process of upgrading its Colossus AI supercomputer cluster from 100,000 NVIDIA Hopper AI GPUs, doubling it to an insane 200,000 NVIDIA Hopper AI GPUs. Colossus is the world's largest AI supercomputer, and is using used to train xAI's Grok family of LLMs (large language models), with chatbots on offer for X Premium subscribers. Elon's massive xAI Colossus supercomputer cluster facility was recently toured (more on that in the links below) and took just 122 days to complete, something NVIDIA CEO Jensen Huang recently called Elon Musk "superhuman" because of it. NVIDIA recently posted some content explaining its partnership with Elon and xAI, with the company explaining: "The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began". "While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control. This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput". Gilad Shainer, senior vice president of networking at NVIDIA explains: "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency. The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions". Elon Musk explained on X: "Colossus is the most powerful training system in the world. Nice work by xAI team, NVIDIA and our many partners/suppliers".
[2]
NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI
NVIDIA Spectrum-X Makes Colossal NVIDIA Hopper 100,000-GPU System Possible NVIDIA today announced that xAI's Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper GPUs, achieved this massive scale by using the NVIDIA Spectrum-X Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access network. Colossus, the world's largest AI supercomputer, is being used to train xAI's Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs. "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at NVIDIA. The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions. "Colossus is the most powerful training system in the world," said Elon Musk. "Nice work by the xAI team, NVIDIA and our many partners/suppliers." Some of the key highlights from the announcement include: The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began.
[3]
Nvidia's Spectrum-X Ethernet to enable the world's largest AI supercomputer -- 200,000 Hopper GPUs
One of the challenges with building high-end AI data centers is connecting servers and making tens of thousands of GPUs work in concert and without problems, making network interconnections as important as GPUs. To build xAI's Colossus supercomputer, which now has 100,000 of Nvidia's Hopper processors and will expand to 200,000 H100 and H200 GPUs in the coming months, the company chose Nvidia's Spectrum-X Ethernet. Nvidia's Spectrum-X platform includes the Spectrum SN5600 Ethernet switch, which enables port speeds up to 800 Gb/s and is built on the Spectrum-4 switch ASIC. The network platform works with Nvidia's BlueField-3 SuperNICs to deliver exceptional speed and efficiency when transferring massive data flows required for AI training. With Spectrum-X, Colossus achieves consistently high data throughput (95%) and virtually eliminates network latency issues and packet loss, allowing seamless operation at an unprecedented scale. The green company says that traditional Ethernet would struggle to handle such a scale, often experiencing heavy congestion and low data throughput. By contrast, Spectrum-X's adaptive routing, congestion control, and performance isolation technologies tackle these issues, ensuring a stable, high-performance environment. "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at Nvidia. "The Nvidia Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions." Even with 100,000 Hopper GPUs, xAI's Colossus is one of the world's most powerful supercomputers for AI training. Yet, it was constructed in just 122 days, and its rapid deployment contrasts sharply with typical timelines for such massive systems, which often span months or even years. This efficiency extended to its operational setup, where training commenced 19 days after the first hardware was delivered and installed. It remains to be seen how long it will take xAI to install 100,000 more Hopper GPUs, though it is safe to say that for a while, this will be the world's most powerful AI supercomputer, at least before Microsoft and Oracle deploy their Blackwell-based PCs. "Colossus is the most powerful training system in the world," said Elon Musk on X. "Nice work by xAI team, NVIDIA and our many partners/suppliers."
[4]
Elon Musk's supercomputer with 100,000 Nvidia GPUs uses proprietary Spectrum-X networking platform
In brief: Elon Musk's wild foray into the AI business has resulted in the construction of a massive supercomputer in record time. Curiously, Nvidia notes that this supersystem doesn't utilize the traditional InfiniBand networking standard to transfer data as one might expect. The high-performance computing system built by xAI, featuring 100,000 Hopper GPUs, is named Colossus. The system utilizes the company's Spectrum-X networking platform instead of InfiniBand, which Nvidia acquired in 2019 along with the last independent supplier of the technology, Mellanox. Nvidia stated that the designers of Colossus achieved the system's massive scale largely thanks to Spectrum-X. This technology significantly improves direct memory access network performance while utilizing "standards-based" Ethernet communication devices. Colossus was constructed in record time, and the xAI team is now in the process of doubling its performance by installing an additional 100,000 Hopper GPUs into the system. Standard Ethernet devices are insufficient for Colossus, as they can cause thousands of flow collisions and deliver a meager 60 percent data throughput. In contrast, Spectrum-X guarantees "zero application latency degradation" and eliminates packet loss due to flow collisions, maintaining a significantly higher 95 percent data throughput through its "congestion control" system. Colossus is training large language models belonging to the Grok family and requires "unprecedented" network performance to do so. Spectrum-X isn't your run-of-the-mill Ethernet technology. The core of the platform is the Spectrum SN5600 Ethernet switch, which Nvidia claims can support up to 800 Gbps per single port. This switch is built on a Spectrum-4 custom ASIC, and xAI has paired it with Nvidia BlueField-3 SuperNICs to effectively accelerate GPU-to-GPU communication. InfiniBand was specifically designed to meet the communication needs of HPC systems, keeping packet loss to an absolute minimum. While Ethernet has a significantly higher rate of data loss, it remains extremely popular - even in the speed-sensitive HPC market - due to factors such as high compatibility, vendor choice, and potentially higher bandwidth capabilities per single port. Nvidia stated that its Spectrum-X Ethernet networking platform can accelerate the development of powerful AI systems like Colossus, reducing the time needed to bring massive HPC machines online. Spectrum-X technology is scalable and can potentially provide networking features that were previously available only through InfiniBand solutions.
[5]
NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI
NVIDIA Spectrum-X Makes Colossal NVIDIA Hopper 100,000-GPU System Possible NVIDIA today announced that xAI's Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-Xâ„¢ Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network. Colossus, the world's largest AI supercomputer, is being used to train xAI's Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs. The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began. While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control. This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput. "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at NVIDIA. "The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions." "Colossus is the most powerful training system in the world," said Elon Musk on X. "Nice work by xAI team, NVIDIA and our many partners/suppliers." "xAI has built the world's largest, most-powerful supercomputer," said a spokesperson for xAI. "NVIDIA's Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard." At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs for unprecedented performance. Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation -- all key requirements for multi-tenant generative AI clouds and large enterprise environments.
[6]
Musk's xAI Taps NVIDIA to Expand Grok Using 'World's Largest AI Supercomputer' - Decrypt
Chipmaker Nvidia announced Monday that its Spectrum-X networking technology has helped expand startup xAI's Colossus supercomputer, now recognized as the largest AI training cluster in the world. Located in Memphis, Tennessee, Colossus serves as the training ground for the third generation of Grok, xAI's suite of large language models developed to power chatbot features for X Premium subscribers. Colossus, completed in just 122 days, began training its first models 19 days after installation. Tech billionaire Elon Musk's startup xAI plans to double the system's capacity to 200,000 GPUs, Nvidia said in a statement on Monday. At its core, Colossus is a giant interconnected system of GPUs, each specialized in processing large datasets. When Grok models are trained, they need to analyze enormous amounts of text, images, and data to improve their responses. Touted by Musk as the most powerful AI training cluster in the world, Colossus connects 100,000 NVIDIA Hopper GPUs using a unified Remote Direct Memory Access network. Nvidia's Hopper GPUs handle complex tasks by separating the workload across multiple GPUs and processing it in parallel. The architecture allows data to move directly between nodes, bypassing the operating system and ensuring low latency as well as optimal throughput for extensive AI training tasks. While traditional Ethernet networks often suffer from congestion and packet loss -- limiting throughput to 60% -- Spectrum-X achieves 95% throughput without latency degradation. Spectrum-X allows large numbers of GPUs to communicate more smoothly with one another, as traditional networks can get bogged down with too much data. The technology allows Grok to be trained faster and more accurately, which is essential for building AI models that respond effectively to human interactions. Monday's announcement had little effect on Nvidia's stock, which dipped slightly. Shares traded at $141 as of Monday, with the company's market cap at $3.45 trillion.
[7]
xAI's 100,000 H100 Colossus is glued together using Ethernet
Work already underway to expand system to 200,000 Nvidia Hopper chips Unlike most AI training clusters, xAI's Colossus with its 100,000 Nvidia Hopper GPUs doesn't use InfiniBand. Instead, the massive system, which Nvidia bills as the "world's largest AI supercomputer," was built using the GPU giant's Spectrum-X Ethernet fabric. Colossus was built to train xAI's Grok series of large language models, which power the chatbot built into Elon Musk's echo chamber colloquially known as Tw..., right, X. The system as a whole is massive, boasting more than 2.5 times the number of GPUs compared to the US' number one ranked Frontier supercomputer at Oak Ridge National Laboratory with its nearly 38,000 AMD MI250X accelerators. Perhaps more impressively, Colossus was deployed in just 122 days and took 19 days to go from first deployment to training. In terms of peak performance, the xAI cluster boasts 98.9 exaFLOPS of dense FP/BF16 -- double that if xAI's models can take advantage of sparsity during training, and double that again to 395 exaFLOPS when training at sparse FP8 precision. However, those performance figures won't last for long. Nvidia reports that xAI has already begun adding another 100,000 Hopper GPUs to the cluster, which would effectively double the system's performance. Even if xAI were to run the High Performance Linpack (HPL) used to rank the world's largest and most powerful publicly known supercomputers on the system, Colossus would almost certainly claim the top spot with 6.7 exaFLOPS of peak FP64 matrix performance. However, that assumes the Ethernet fabric used to stitch those GPUs together can keep up. There is a reason, after all, HPC centers tend to opt for Infiniband. Beyond Colossus' massive performance figures, it's worth talking about this networking choice. As we previously discussed, as of early 2024, about 90 percent of AI clusters used Nvidia's InfiniBand networking. The reason comes down to scale. Training large models requires distributing workloads across hundreds and even thousands of nodes. Any amount of packet loss can result in higher tail latencies and therefore slower time to train models. InfiniBand is designed to keep packet loss to an absolute minimum. On the other hand, packet loss is a fact of life in traditional Ethernet networks. Despite this, Ethernet remains attractive for a variety of reasons, including cross compatibility, vendor choice, and, often higher per port bandwidth. So, to overcome Ethernet's limitations, Nvidia developed its Spectrum X family of products, which include its Spectrum Ethernet switches and BlueField SuperNIC. Specifically, Colossus used the 51.2 Tbps Spectrum SN5600, which crams 64 800GbE ports into a 2U form factor. Meanwhile, the individual nodes used Nvidia's BlueField-3 SuperNICs, which feature a single 400GbE connection to each GPU in the cluster. But while Nvidia can't deliver 800 Gbps networking to each accelerator just yet, its next-generation ConnectX-8 SuperNICs will. The idea is that by building logic into both the switch and the NIC, the two can take advantage of high-speed packet reordering, advanced congestion control, and programmable I/O pathing to achieve InfiniBand-like loss and latencies over Ethernet. "Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions," Nvidia claimed in a recent blog post, adding that it has also managed to achieve 95 percent data throughput thanks to the fabric's congestion controls. For comparison, Nvidia argues that, at this scale, standard Ethernet would have created thousands of flow collisions and would have only achieved 60 percent of its data throughput. Nvidia isn't the only networking vendor looking to overcome Ethernet's limitations using SmartNICs and switches. As we've previously discussed, Broadcom is actually doing something quite similar but rather than at the NIC level, it's focusing primarily on reducing packet loss between its Jericho3-AI top-of-rack switches and its Tomahawk 5 aggregation switches. AMD is also getting in on the fun with its upcoming Ultra Ethernet-based Pensando Pollara 400, which will feature the same kind of packet spraying and congestion control tech we've seen from Nvidia, Broadcom, and others to achieve InfiniBand-like loss and latencies. ®
[8]
xAI Is Now In The Process Of Doubling The Size Of Its Colossus Supercluster To 200,000 NVIDIA Hopper GPUs
This is not investment advice. The author has no position in any of the stocks mentioned. Wccftech.com has a disclosure and ethics policy. Jensen Huang had termed Elon Musk "superhuman" when he described in a recent interview how xAI was able to bring together NVIDIA's gear and operationalize it within its own data center in just 19 days. Now, Musk appears determined to subdue his competitors by continuing to pursue a shock and awe campaign that will see xAI's Supercluster double in size. For the benefit of those who might not be aware, xAI's Colossus supercomputer cluster currently consists of 100,000 units of NVIDIA's liquid-cooled H100 GPUs. Dubbed the world's largest AI supercomputer, the Colossus is right now training xAI's Grok family of large language models (LLMs). Now, NVIDIA has revealed in a dedicated press release that xAI is doubling the size of its Colossus supercluster: "xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs." Bear in mind that xAI and NVIDIA were able to bring the Colossus online in just 122 days when it would ordinarily take "many months to years" to operationalize such an intricate system. What's more, xAI was able to commence the training of its Grok LLM within 19 days of the first H100 GPU rack rolling onto the floor of the AI gigafactory. NVIDIA goes on to note: "Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control." Meanwhile, as we mentioned earlier, NVIDIA's CEO was quite effusive of Elon Musk in a recent interview (watch here), going so far as to term him a "superhuman" and "singular" in his understanding of engineering and construction: "... Just building a massive factory, liquid-cooled, energized, permitted in the short time that was done...I mean that is, like, superhuman. Yeah, there's. And, as far as I know, there's only one person in the world who could do that. You know, I mean, Elon is singular in this understanding of engineering and construction and large systems, and marshaling resources ..." Bear in mind that Morgan Stanley expects NVIDIA to sell around 1.5 million units of its Hopper GPUs in the fourth quarter of 2024, before ramping the sales down to 1 million units in the first quarter of 2025 as Blackwell volumes begin to soar.
[9]
First in-depth look at Elon Musk's 100,000 GPU AI cluster -- xAI Colossus reveals its secrets
Now, witness the firepower of this fully armed and operational AI supercluster Elon Musk's new expensive project, the xAI Colossus AI supercomputer, has been detailed for the first time. YouTuber ServeTheHome was granted access to the Supermicro servers within the 100,000 GPU beast, showing off several facets of the supercomputer. Musk's xAI Colossus supercluster has been online for almost two months, after a 122-day assembly. Patrick from ServeTheHome takes a camera around several parts of the server, providing a birds-eye view of its operations. The finer details of the supercomputer, like its power draw and pump sizes, could not be revealed under a non-disclosure agreement, and xAI blurred and censored parts of the video before its release. The most important things, like the Supermicro GPU servers, were left mostly intact in the footage above. The GPU servers are Nvidia HGX H100s, a server solution containing eight H100 GPUs each. The HGX H100 platform is packaged inside Supermicro's 4U Universal GPU Liquid Cooled system, providing easy hot-swappable liquid cooling to each GPU. These servers are loaded inside racks which hold eight servers each, making 64 GPUs per rack. 1U manifolds are sandwiched between each HGX H100, providing the liquid cooling the servers need. At the bottom of each rack is another Supermicro 4U unit, this time with a redundant pump system and rack monitoring system. These racks are paired in groups of eight, making 512 GPUs per array. Each server has four redundant power supplies, with the rear of the GPU racks revealing 3-phase power supplies, Ethernet switches, and a rack-sized manifold providing all of the liquid cooling. There are over 1,500 GPU racks within the Colossus cluster, or close to 200 arrays of racks. According to Nvidia CEO Jensen Huang, the GPUs for these 200 arrays were fully installed in only three weeks. Because of the high-bandwidth requirements of an AI supercluster constantly training models, xAI went beyond overkill for its networking interconnectivity. Each graphics card has a dedicated NIC (network interface controller) at 400GbE, with an extra 400Gb NIC per server. This means that each HGX H100 server has 3.6 Terabit per second ethernet. And yes, the entire cluster runs on Ethernet, rather than InfiniBand or other exotic connections which are standard in the supercomputing space. Of course, a supercomputer based on training AI models like the Grok 3 chatbot needs more than just GPUs to function. Details on the storage and CPU computer servers in Colossus are more restricted. From what we can see in Patrick's video and blog post, these servers are also mostly in Supermicro chassis. Waves of NVMe-forward 1U servers with some kind of x86 platform CPU inside hold either storage and CPU compute, also with rear-entry liquid cooling. Outside, some heavily bundled banks of Tesla Megapack batteries are seen. The start-and-stop nature of the array with its milliseconds of latency between banks was too much for the power grid or Musk's diesel generators to handle, so some amount of Tesla Megapacks (holding up to 3.9 MWh each) are used as an energy buffer between the power grid and the supercomputer. The xAI Colossus supercomputer is currently, according to Nvidia, the largest AI supercomputer in the world. While many of the world's leading supercomputers are research bays usable by many contractors or academics for studying weather patterns, disease, or other difficult compute tasks, Colossus is solely responsible for training X's (formerly Twitter) various AI models. Primarily Grok 3, Elon's "anti-woke" chatbot only available to X Premium subscribers. ServeTheHome was also told that Colossus is training AI models "of the future"; models whose uses and abilities are supposedly beyond the powers of today's flagship AI. Colossus's first phase of construction is complete and the cluster is fully online, but it's not all done. The Memphis supercomputer will soon be upgraded to double its GPU capacity, with 50,000 more H100 GPUs and 50,000 next-gen H200 GPUs. This will also more than double its power consumption, which is already too much for Musk's 14 diesel generators added to the site in July to handle. It also falls below Musk's promise of 300,000 H200s inside Colossus, though that may become phase 3 of upgrades. The 50,000 GPU Cortex supercomputer in the "Giga Texas" Tesla plant is also under a Musk company. Cortex is devoted to training Tesla's self-driving AI tech through camera feed and image detection alone, as well as Tesla's autonomous robots and other AI projects. Tesla will also soon see the construction of the Dojo supercomputer in Buffalo, New York, a $500 million project coming soon. With industry speculators like Baidu CEO Robin Le predicting that 99% of AI companies will crumble when the bubble pops, it remains to be seen if Musk's record-breaking AI spending will backfire or pay off.
[10]
Elon Musk is doubling the world's largest AI GPU cluster -- expanding Colossus GPU cluster to 200,000 'soon,' has floated 300,000 in the past
xAI Colossus AI supercomputer continues to grow at a very fast pace Billionaire Elon Musk has taken to Twitter / X to boast that his remarkable xAI data center is set to double its firepower "soon." He was commenting on the recent video exposé of his xAI Colossus AI supercomputer. In the highlighted video, TechTuber ServeTheHome was stunned when he saw the gleaming rows of Supermicro servers packed with 100,000 state-of-the-art Nvidia enterprise GPUs. So, the xAI Colossus AI supercomputer is on course "Soon to become a 200k H100/H200 training cluster in a single building." Its 100,000 GPU incarnation, which only just started AI training about two weeks ago, was already notable. While we think "soon" might indeed be soon in this case. However, Musk's prior tech timing slippages (e.g., Tesla's full self-driving, Hyperloop delays, SolarCity struggles) mean we should be generally cautious about his forward-looking boasts. The xAI Colossus has already been dubbed an engineering marvel. Importantly, praise for the supercomputer's prowess isn't limited to the usual Musk toadies. Nvidia CEO Jensen Huang also described this supercomputer project as a "superhuman" feat that had "never been done before." xAI engineers must have worked very hard and long hours to set up the xAI Colossus AI supercomputer in 19 days. Typically, projects of this scale and complexity can take up to four years to get running, indicated Huang. What will the 200,000 H100/H200 GPUs be used for? This very considerable computing resource will probably not be tasked with making scientific breakthroughs for the benefit of mankind. Instead, the 200,000 power-hungry GPUs are likely destined to train AI models and chatbots like Grok 3, ramping up the potency of its machine learning distilled 'anti-woke' retorts. This isn't the hardware endgame for xAI Collosus hardware expansion, far from it. Musk previously touted a Colossus packing 300,000 Nvidia H200 GPUs throbbing within. At the current pace of upgrades, we could even see Musk Tweeting about reaching this 300,000 goal before 2024 is out. Perhaps, if anything delays 'Grok 300,000,' it could be factors outside of Musk's control, like GPU supplies. We have also previously reported that on-site power generation had to be beefed up to cope even with stage 1 of xAI's Colossus, so that's another hurdle - alongside complex liquid cooling and networking hardware.
[11]
World's 1st AI ethernet by Nvidia powers Elon Musk's Colossus supercomputer
Elon Musk recently stated that xAI's Colossus supercomputer is the "most powerful AI training system in the world." US chipmaker Nvidia announced on Monday, October 28, that it has helped Elon Musk's xAI expand its Colossus supercomputer. The Colossus supercomputer cluster is now recognized as the largest AI training cluster in the world. Thanks partly to Nvidia's Spectrum-X ethernet networking technology, xAI can take its ChatGPT-rivaling Grok AI to new levels. Founded by Elon Musk last year, xAI is a startup that provides a service similar to Open AI's ChatGPT. In a move typical of Musk, the company has a superb mission goal that strikes the core of our existence. That goal, the company says, is to use generative artificial intelligence "to understand the true nature of the universe."
[12]
Elon Musk Prepares to Double xAI Supercomputer to 200,000 Nvidia GPUs
In the race to build next-generation AI, 100,000 enterprise-grade GPUs aren't enough for Elon Musk. His xAI startup is already preparing to expand a supercomputer in Memphis, Tennessee, to 200,000 GPUs. Nvidia spilled the news on Monday, revealing that xAI's "Colossus" supercomputer is in the process of doubling its size. Musk also tweeted that the supercomputer is close to incorporating 200,000 H100 and H200 Nvidia GPUs inside a 785,000-square-foot building. Musk's supercomputer in Memphis is noteworthy for how quickly his startup assembled the GPUs into a working cluster of AI processing. "From start to finish, it was done in 122 days," Musk has said. Supercomputers usually take years to build. His company also likely paid at least $3 billion to assemble the supercomputer since it's currently made up of 100,000 Nvidia H100 GPUs, which usually cost around $30,000 apiece. Musk now wants to upgrade the facility with H200 GPUs, which feature more memory but cost closer to $40,000 per unit. As a result, Musk will need to fork over billions more in addition to paying the electricity costs for the supercomputer. His ultimate goal is to expand the Colossus supercomputer to 300,000 Nvidia Blackwell B200 GPUs by next summer. Musk is betting big on Nvidia's GPU technology to help him improve xAI's Grok chatbot and other AI technologies. On Monday, ServeTheHome posted a video from inside the Colossus supercomputing facility, which contains numerous server racks of Nvidia GPUs. Musk isn't alone in buying GPUs to train next-generation AI. Meta, OpenAI, and Microsoft have also been acquiring Nvidia's technologies, including Blackwell GPUs.
[13]
NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI - NVIDIA (NASDAQ:NVDA)
SANTA CLARA, Calif., Oct. 28, 2024 (GLOBE NEWSWIRE) -- NVIDIA today announced that xAI's Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper Tensor Core GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network. Colossus, the world's largest AI supercomputer, is being used to train xAI's Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs. The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began. While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control. This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput. "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at NVIDIA. "The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions." "Colossus is the most powerful training system in the world," said Elon Musk on X. "Nice work by xAI team, NVIDIA and our many partners/suppliers." "xAI has built the world's largest, most-powerful supercomputer," said a spokesperson for xAI. "NVIDIA's Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard." At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3® SuperNICs for unprecedented performance. Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation -- all key requirements for multi-tenant generative AI clouds and large enterprise environments. About NVIDIA NVIDIA NVDA is the world leader in accelerated computing. For further information, contact: Alex Shapiro NVIDIA Corporation +1-415-608-5044 ashapiro@nvidia.com Certain statements in this press release including, but not limited to, statements as to: the benefits, impact, and performance of NVIDIA's products, services, and technologies, including NVIDIA Hopper Tensor Core GPUs, NVIDIA Spectrum-X Ethernet networking platform, NVIDIA Spectrum SN5600 Ethernet switch, Spectrum-4 switch ASIC, and NVIDIA BlueField-3 SuperNICs; features of xAI's Colossus supercomputer cluster; xAI being in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs; the NVIDIA Spectrum-X Ethernet networking platform being designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerating the development, deployment and time to market of AI solutions; NVIDIA's Hopper GPUs and Spectrum-X allowing xAI to push the boundaries of training AI models at a massive scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard are forward-looking statements that are subject to risks and uncertainties that could cause results to be materially different than expectations. Important factors that could cause actual results to differ materially include: global economic conditions; our reliance on third parties to manufacture, assemble, package and test our products; the impact of technological development and competition; development of new products and technologies or enhancements to our existing product and technologies; market acceptance of our products or our partners' products; design, manufacturing or software defects; changes in consumer preferences or demands; changes in industry standards and interfaces; unexpected loss of performance of our products or technologies when integrated into systems; as well as other factors detailed from time to time in the most recent reports NVIDIA files with the Securities and Exchange Commission, or SEC, including, but not limited to, its annual report on Form 10-K and quarterly reports on Form 10-Q. Copies of reports filed with the SEC are posted on the company's website and are available from NVIDIA without charge. These forward-looking statements are not guarantees of future performance and speak only as of the date hereof, and, except as required by law, NVIDIA disclaims any obligation to update these forward-looking statements to reflect future events or circumstances. © 2024 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Spectrum-X and BlueField are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Features, pricing, availability and specifications are subject to change without notice. A photo accompanying this announcement is available at https://www.globenewswire.com/NewsRoom/AttachmentNg/32f7e01d-2845-40ac-9a09-2226d1f79ec0 Market News and Data brought to you by Benzinga APIs
Share
Share
Copy Link
Elon Musk's xAI is expanding its Colossus AI supercomputer from 100,000 to 200,000 NVIDIA Hopper GPUs, making it the world's largest AI training system. The project showcases NVIDIA's Spectrum-X Ethernet networking platform, achieving unprecedented performance in AI workloads.
Elon Musk's artificial intelligence company, xAI, is in the process of doubling the capacity of its Colossus supercomputer cluster from 100,000 to an impressive 200,000 NVIDIA Hopper GPUs 12. This expansion will solidify Colossus's position as the world's largest AI supercomputer, primarily used for training xAI's Grok family of large language models 3.
The Colossus facility, located in Memphis, Tennessee, was built by xAI and NVIDIA in a remarkably short timeframe of just 122 days 5. This rapid deployment stands in stark contrast to typical timelines for systems of this scale, which often take months or even years to complete 3. NVIDIA CEO Jensen Huang praised Elon Musk as "superhuman" for this achievement 1.
At the heart of Colossus's exceptional performance is NVIDIA's Spectrum-X Ethernet networking platform 2. This advanced technology enables the supercomputer to achieve unprecedented network performance, maintaining 95% data throughput and experiencing zero application latency degradation or packet loss due to flow collisions across all three tiers of the network fabric 5.
The Spectrum-X platform is built around the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC 35. xAI has paired this switch with NVIDIA BlueField-3 SuperNICs to maximize performance 5. This combination delivers superior efficiency in transferring the massive data flows required for AI training 3.
Spectrum-X's performance significantly outpaces standard Ethernet solutions, which typically create thousands of flow collisions and deliver only 60% data throughput 5. The platform incorporates advanced features such as adaptive routing, congestion control, and performance isolation technologies, ensuring a stable, high-performance environment for AI workloads 3.
Gilad Shainer, senior vice president of networking at NVIDIA, emphasized the critical role of enhanced networking in AI development: "AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency. The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions" 25.
The expansion of Colossus and the implementation of Spectrum-X technology demonstrate the rapid advancements in AI infrastructure. This development is likely to accelerate the creation and deployment of more sophisticated AI models, potentially revolutionizing various industries and applications 4. As Elon Musk stated on X (formerly Twitter), "Colossus is the most powerful training system in the world," highlighting the significance of this achievement in the field of artificial intelligence 125.
Reference
[1]
[2]
Analytics India Magazine
|NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI[3]
[4]
Elon Musk's XAI has launched Colossus, a groundbreaking AI training system utilizing 100,000 NVIDIA H100 GPUs. This massive computational power aims to revolutionize AI development and compete with industry giants.
10 Sources
10 Sources
Elon Musk's XAI introduces Colossus, the world's most powerful AI training system. While impressive, questions arise about its storage capacity, power usage, and naming convention.
2 Sources
2 Sources
Elon Musk's AI startup xAI is set to dramatically expand its Colossus supercomputer in Memphis, Tennessee, aiming to reach over 1 million GPUs. This ambitious project involves partnerships with major tech companies and significant infrastructure challenges.
10 Sources
10 Sources
Nvidia CEO Jensen Huang lauds Elon Musk and xAI for constructing a supercomputer with 100,000 GPUs in just 19 days, a feat that typically takes years to accomplish.
6 Sources
6 Sources
Elon Musk's AI company, xAI, has introduced a powerful new supercomputer named 'Memphis' to train its next-generation AI model, Grok 3. The system boasts an impressive array of 100,000 Nvidia H100 GPUs, positioning it as one of the most potent AI training clusters globally.
11 Sources
11 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved