Curated by THEOUTPOST
On Wed, 4 Sept, 8:02 AM UTC
10 Sources
[1]
Elon Musk teases Colossus: most powerful AI training system uses 100,000 NVIDIA H100 AI GPUs
Elon Musk has announced a major milestone for his AI startup, xAI, which just turned its new AI training system "Colossus" online over the weekend. Elon tweeted: "This weekend, the xAI team brought our Colossus 100K H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months. Excellent work by the team, NVIDIA and our many partners/suppliers". Colossus is home to 100,000 of NVIDIA's current-gen Hopper H100 AI GPUs, while Musk says that soon the most powerful AI training system in the world will have 50,000 of NVIDIA's beefed-up H200 AI GPUs (faster HBM3E memory, and more of it over the H100 AI GPU). xAI's flagship LLM -- Grok 2 -- was trained on 15,000 AI GPUs... so with Colossus having access to 100,000+ AI GPUs, we could see next-generation large language models with far better capabilities unleashed. Elon Musk himself said back in April 2024 that training Grok 3 would require 100,000 NVIDIA H100 AI GPUs... and just 5 months later we're here, with 100,000 NVIDIA H100 AI GPUs fired up and training away. NVIDIA's new Hopper H200 AI GPUs have up to 141GB of faster HBM3E memory, while H100 has up to 80GB of HBM3 memory. Elon Musk and the xAI team are surely having a field day with this immense amount of AI training power.
[2]
xAI launches 'Colossus' AI training system with 100,000 Nvidia chips - SiliconANGLE
Elon Musk's xAI Corp. has completed the assembly of an artificial intelligence training system that features 100,000 graphics cards. Musk announced the milestone in a Monday post on X. The system, which xAI calls Colossus, came online over the weekend. Musk launched xAI last year to compete with OpenAI, which he is currently suing for alleged breach of contract. The startup develops a line of large language models called Grok. In May, xAI raised $6 billion at a $24 billion valuation to finance its AI development efforts. In this week's X post, Musk described the newly launched Colossus as the "most powerful AI training system in the world." That suggests the cluster is faster than the U.S. Energy Department's Aurora system, which ranks as the world's fastest AI supercomputer. In a May benchmark test, Aurora reached a top speed of 10.6 exaflops with 87% of its hardware active. Musk detailed that Colossus is equipped with 100,000 of Nvidia's H100 graphics cards. The H100 debuted in 2022 and ranked as the chipmaker's most powerful AI processor for more than a year. It can run language models up to 30 times faster than Nvidia's previous-generation GPUs. One contributor to the H100's performance is its so-called Transformer Engine module. It's a set of circuits optimized to run AI models based on the Transformer neural network architecture. The architecture underpins GPT-4o, Meta Platforms Inc.'s Llama 3.1 405B and many other frontier LLMs. Musk detailed that xAI plans to double Colossus' chip count to 200,000 within a few months. He elaborated that 50,000 of the new processors will be H200s. The H200 is an upgraded, significantly faster version of the H100 that Nvidia debuted last November. AI models shuffle information between the logic circuits of the chip on which they run and its memory more often than many other workloads. As a result, accelerating the movement of data between the memory and logic modules can boost AI models' performance. Nvidia's H200 carries out such data transfers significantly faster than the H100. The H200's speed advantage is the result of two architectural upgrades. First, Nvidia swapped the HBM3 memory in the H100 with a newer type of RAM called HBM3e that facilitates faster data transfers to and from the chip's logic circuits. Second, the company nearly doubled the onboard memory capacity to 141 gigabytes, which allows the H200 to keep more of an AI model's data near its logic circuits. Grok-2, xAI's flagship LLM, was trained on 15,000 GPUs. Colossus' 100,000 chips could potentially facilitate the development of language models with significantly better capabilities. The company reportedly hopes to release the successor to Grok-2 by year's end.
[3]
Colossus: NVIDIA gave the world's most powerful AI training system to Elon Musk
Colossus is a groundbreaking artificial intelligence (AI) training system developed by Elon Musk's xAI Corp. This supercomputer, described by Musk as the "most powerful AI training system in the world," is a critical component of xAI's strategy to lead in the rapidly advancing field of AI. At the core of Colossus are 100,000 NVIDIA H100 graphics cards. These GPUs (Graphics Processing Units) are specifically designed to handle the demanding computational requirements of AI training and here is why these GPUs are so vital: Musk has ambitious plans to further expand Colossus, aiming to double the system's GPU count to 200,000 in the near future. This expansion will include 50,000 units of Nvidia's H200, an even more powerful successor to the H100. The H200 offers several significant upgrades: Colossus is specifically designed to train large language models (LLMs), which are the foundation of advanced AI applications. The sheer number of GPUs in Colossus allows xAI to train AI models at a scale and speed that is unmatched by other systems. For example, xAI's current flagship LLM, Grok-2, was trained on 15,000 GPUs. With 100,000 GPUs now available, xAI can train much larger and more complex models, potentially leading to significant improvements in AI capabilities. The advanced architecture of the H100 and H200 GPUs ensures that models are trained not only faster but with greater precision. The high memory capacity and rapid data transfer capabilities mean that even the most complex AI models can be trained more efficiently, resulting in better performance and accuracy. Colossus is not just a technical achievement; it's a strategic asset in xAI's mission to dominate the AI industry. By building the world's most powerful AI training system, xAI positions itself as a leader in developing cutting-edge AI models. This system gives xAI a competitive advantage over other AI companies, including OpenAI, which Musk is currently in legal conflict with. Moreover, the construction of Colossus reflects Musk's broader vision for AI. By reallocating resources from Tesla to xAI, including the rerouting of 12,000 H100 GPUs worth over $500 million, Musk demonstrates his commitment to AI as a central focus of his business empire.
[4]
xAI brings Colossus, world's 'most powerful AI training system,' online: Musk
xAI, the artificial intelligence startup founded by Tesla (NASDAQ:TSLA) CEO Elon Musk, brought its massive AI training system, Colossus, online over the weekend. The AI training cluster is powered by 100,000 of Nvidia's (NASDAQ:NVDA) H100 GPUs, Musk said in a post on X. The process to build it took 122 days. "Colossus is the most powerful AI training system in the world," Musk said. "Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent work by the team, Nvidia and our many partners/suppliers." The battler for AI supremacy has escalated demand for Nvidia's coveted processors. "We have introduced Grok-2, positioning us at the forefront of AI development," xAI said in a blog post. "Our focus is on advancing core reasoning capabilities with our new compute cluster. We will have many more developments to share in the coming months."
[5]
xAI Colossus supercomputer with 100K H100 GPUs comes online -- Musk lays out plans to double GPU count to 200K with 50K H100 and 50K H200
Elon Musk's X (formerly Twitter) has brought the world's most powerful training system online. The Colossus supercomputer uses as many as 100,000 Nvidia H100 GPUs for training and is set to expand with another 50,000 Nvidia H100 and H200 GPUs in the coming months. "This weekend, the xAI team brought our Colossus 100K H100 training cluster online," Elon Musk wrote in an X post. "From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months." According to Michael Dell, the head of the high-tech giant, Dell developed and assembled the Colossus system quickly. This highlights that the server maker has accumulated considerable experience deploying AI servers during the last few years' AI boom. Elon Musk and his companies have been busy making supercomputer-related announcements recently. In late August, Tesla announced its Cortex AI cluster featuring 50,000 Nvidia H100 GPUs and 20,000 of Tesla's Dojo AI wafer-sized chips. Even before that, in late July, X kicked off AI training on the Memphis Supercluster, comprising 100,000 liquid-cooled H100 GPUs. This supercomputer has to consume at least 150 MW of power, as 100,000 H100 GPUs consume around 70 MW. Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs. xAI's training of the Grok version 2 large language model (LLM) required up to 20,000 Nvidia H100 GPUs, and Musk predicted that future versions, such as Grok 3, will need even more resources, potentially around 100,000 Nvidia H100 processors for training. To that end, xAI needs its vast data centers to train Grok 3 and then run inference on this model.
[6]
Elon Musk's xAI Launches Colossus Training Cluster
Elon Musk's artificial intelligence startup xAI launched its Colossus 100k H100 training cluster over the weekend. "Colossus is the most powerful AI training system in the world," Musk said in a post on social platform X. "Moreover, it will double in size to 200k (50k H200s) in a few months." The H200 is an Nvidia GPU designed to accelerate generative AI and large language models, according to the chip maker's website. In a repost of Musk's post, Nvidia Data Center said on X that xAI's Colossus is "the world's largest GPU #supercomputer" and it came online "in record time." "Colossus is powered by @nvidia's #acceleratedcomputing platform, delivering breakthrough performance with exceptional gains in #energyefficiency," the company added in its post. The news came about three months after xAI raised $6 billion in a Series B funding round, saying it would use the money to take its first products to market, build advanced infrastructure and accelerate its research and development. When announcing the round May 27, xAI said that in the time that since Musk announced the formation of the company in July 2023, it had launched its AI chatbot Grok-1, the Grok-1.5 model with long context capability, the Grok-1.5v model with image understanding, and the open-source release of Grok-1. "XAI will continue on this steep trajectory of progress over the coming months, with multiple exciting technology updates and products soon to be announced," the company said at the time. Musk launched xAI after hinting for months that he wanted to build an alternative to OpenAI's AI-powered chatbot, ChatGPT. He was involved in the creation of that company but left its board in 2018 and became increasingly critical of OpenAI and cautious about developments around AI in general. During xAI's July 2023 Twitter Spaces introduction to the general public, Musk said that while he sees xAI in direct competition with larger businesses and upstarts in the AI space, his firm would take a different approach to establishing its foundation model. While "xAI is not trying to solve [artificial general intelligence] on a laptop, [and] there will be heavy compute," his team will have free reign to explore ideas other than scaling up the foundational model's data parameters.
[7]
Most powerful AI training system in the world goes online | Digital Trends
The race for AI supremacy is once again accelerating as xAI CEO Elon Musk announced via Twitter that his company successfully brought its Colossus AI training cluster, which Musk bills as the world's "most powerful," online over the weekend. Recommended Videos "This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent work by the team, Nvidia and our many partners/suppliers," Musk wrote in a post on X. Musk's "most powerful" claim is based on the number of GPUs employed by the system. With 100,000 Nvidia H100s driving it, Colossus is estimated to be larger than any other AI system developed to date. Musk began purchasing tens of thousands of GPUs in April 2023 to accelerate his company's AI efforts, shortly after penning an open letter calling for an industrywide, six month "pause" on AI development. In March of that year, Musk claimed that the company would leverage AI to "detect & highlight manipulation of public opinion" on Twitter, though the GPU supercomputer will likely also be leveraged to train its large language model (LLM), Grok. Grok was introduced by xAI in 2023 in response to the success of rivals like ChatGPT, Gemini, Llama 3.1, and Claude. The company released the updated Grok-2 as a beta in August. "We have introduced Grok-2, positioning us at the forefront of AI development," xAI wrote in a recent blog post. "Our focus is on advancing core reasoning capabilities with our new compute cluster. We will have many more developments to share in the coming months." Musk claims that he can also develop Tesla into "a leader in AI & robotics," however, a recent report from CNBC suggests that Musk has been diverting shipments of Nvidia's highly sought-after GPUs from the electric automaker to xAI and Twitter. Doing so could delay Tesla's efforts to install the compute resources needed to develop its autonomous vehicle technology and the Optimus humanoid robot. "Elon prioritizing X H100 GPU cluster deployment at X versus Tesla by redirecting 12k of shipped H100 GPUs originally slated for Tesla to X instead," an Nvidia memo from December obtained by CNBC reads. "In exchange, original X orders of 12k H100 slated for [January] and June to be redirected to Tesla."
[8]
Elon Musk's monster wakes up as xAI turns on 'Colossus', the Nvidia-powered AI-training supercomputer claiming to be the most powerful in the world
Rome wasn't built in a day, they say. Okay, but it still only took Elon Musk just 122 of 'em to tool up what is claimed to be the most powerful AI training system on the planet. Everyone's favourite billionaire misanthrope doesn't hang about, then. Musk's new toy, dubbed Colossus and built by his new AI startup, xAI, has been created to train the latest version of the GROK language model, known as GROK-3. It's powered by no fewer than 100,000 Nvidia H100 GPUs. If that's not enough for you, in an X post Musk says Colossus will double in power "in a few months" thanks to the addition of another 50,000 H200 Nvidia chips, which each pack roughly twice the AI acceleration performance of an H100 GPU. It's not clear how much this is all costing Musk and xAI. Estimates of pricing for Nvidia's H100 GPUs vary from $20,000 to as much at $90,000 a pop. Presumably, Musk managed to get a comparatively decent deal buying 100,000 of the things in one go. But even at the lower estimate, you're looking at $2 billion for the Nvidia chips for phase one, let alone building the datacenter, all the relevant infrastructure, staffing up, and doing all the work involved in setting up training for an advanced LLM. Oh, and whatever those other 50,000 H200 are costing on top as a little light frosting. Indeed, it was only a few weeks ago that xAI launched GROK-2 as an exclusive-access thing for X subscribers. GROK-2 apparently made do with a piffling 15,000 H100 chips for training, the poor deluded little AI dear. And yet by some measures, GROK-2 ranks second and only behind ChatGPT-4o in the LLM league tables. So, even the first phase will be six to seven times more powerful than GROK-2, only to supposedly double in power a few months later. Clearly, Musk has his sights set on building the most powerful LLM out there. As for when GROK-3 might be unleashed, Musk told conservative polemicist and latterly podcaster Jordan Peterson just last month that he hoped GROK-3 would go live by December. Incidentally, such a machine doesn't come without collateral consequences. The new cluster, located in Memphis, Tennessee, will chew through 150 megawatts of power and has been allocated up to one million gallons of water a day for cooling. So, add environmental impact to the roster of reasons to be unnerved by Colossus, alongside wider concerns about the direct impact of AI and Musk's ever-increasing volatility. That's plenty to be getting on with.
[9]
Elon's 'Colossus' Supercomputer Built With 100K H100 NVIDIA GPUs Goes Online, H200 Upgrade Coming Soon
NVIDIA congratulates xAI team for developing the most powerful NVIDIA-based AI training system, built with 100K H100 GPUs, Additional 50K H100 and 50K H200 Accelerators Planned In Upgrade Elon Musk's venture xAI has finally completed its development for the 'Colossus' Supercomputer, which went online on Labor Day a few days ago. Musk said that Colossus is the 'most powerful AI training system in the world' which was completed in 122 days from start to finish. The Colossus supercomputer uses 100,000 NVIDIA H100 data center GPUs, making it the largest training cluster to use such a huge number of H100s. Elon also announced that in the upcoming months, the Colossus will be upgraded with 50,000 more H200 GPUs, which is the flagship datacenter GPU using the Hopper architecture. The H200 is significantly more powerful than the H100, bringing almost 45% higher compute performance in specific generative AI and HPC. NVIDIA congratulated the xAI team for completing such a large project in just 4 months. NVIDIA added, Colossus is powered by 's #acceleratedcomputing platform, delivering breakthrough performance with exceptional gains in #energyefficiency. The xAI Colossus project was started in June in Memphis and its training commenced in July. This will prepare the GROK 3 by December replacing GROK 2 for delivering the most powerful AI in the world. The Colossus supercomputer came after the ending of the deal with Oracle, which was renting its server to xAI. The new supercluster is now more powerful than what Oracle could provide and is going to be doubled in performance in a few months with the addition of 50K more H200 GPUs. The H200 brings almost 61GB higher memory, and a significantly higher memory bandwidth of 4.8TB/s compared to 3.35TB/s on the H100. That said, with such a drastic change in the specs, the H200 consumes 300W more power and will require liquid cooling just as the H100s in the Colossus utilize liquid cooling. At the moment, Colossus is the only supercomputer that has reached 100K NVIDIA GPUs, followed by Google AI with 90K GPUs, and then the popular OpenAI, which uses 80K H100 GPUs. Meta AI and Microsoft AI are next with 70K and 60K GPUs.
[10]
Elon Musk's AI supercomputer is here
Elon Musk launched his artificial intelligence startup xAI last July, and it now has "the most powerful AI training system in the world." The training cluster, called Colossus, is powered by 100,000 Nvidia H100 graphics processing units, or GPUs, and is expected to double in size to 200,000 chips, including 50,000 of Nvidia's more powerful H200 chips, "in a few months," Musk said. While AI rivals including OpenAI and Meta also have hundreds of thousands of Nvidia's chips, Colossus, which was brought online in about four months, has the most processors of an individual AI computing cluster in the world. "Excellent work by the team, Nvidia and our many partners/suppliers," Musk said on his social media platform, X. xAI raised $6 billion in a Series B funding round in May, which included heavyweight investors such as Andreessen Horowitz and Sequoia Capital. The round pushed xAI's valuation to $24 billion. Meanwhile, xAI is facing blame from local advocates in Memphis for making pollution worse from its use of gas-powered turbines, as smog in the city exceeds national air quality standards. In a letter to the Shelby County Health Department in August, the Southern Environmental Law Center said xAI's supercomputer "requires an enormous amount of electricity," and that the startup "has installed at least 18 gas combustion turbines over the last several months" to meet that demand, with more possibly coming. The startup "apparently has not applied" for the air permits required before installation and operation of some of the turbines, the SELC said. The environmental nonprofit is asking the health department to confirm if xAI is operating its turbines without an air permit, and to order the startup to stop operating until it gets the permit.
Share
Share
Copy Link
Elon Musk's XAI has launched Colossus, a groundbreaking AI training system utilizing 100,000 NVIDIA H100 GPUs. This massive computational power aims to revolutionize AI development and compete with industry giants.
In a groundbreaking development, Elon Musk's artificial intelligence company, XAI, has unveiled Colossus, touted as the world's most powerful AI training system. This massive computational powerhouse harnesses the strength of 100,000 NVIDIA H100 GPUs, setting a new benchmark in the AI industry 1.
Colossus represents a significant leap in AI training capabilities. The system's 100,000 NVIDIA H100 GPUs provide an astounding 20 exaFLOPS of AI performance 2. This immense processing power is expected to accelerate AI model training and enable the development of more sophisticated AI systems.
Musk has already outlined plans to double Colossus's capacity. The expansion will involve adding 50,000 H100 GPUs and 50,000 of NVIDIA's next-generation H200 GPUs, bringing the total to an impressive 200,000 GPUs 5. This upgrade is anticipated to further enhance the system's capabilities and maintain its competitive edge.
Colossus is currently being utilized to train XAI's large language model, Grok AI 3. The system's immense computational power is expected to significantly improve Grok's performance and capabilities, potentially rivaling or surpassing other leading AI models in the market.
The introduction of Colossus positions XAI as a formidable competitor in the AI landscape. With its unprecedented scale, the system challenges the dominance of tech giants like Google, Microsoft, and OpenAI in the field of large-scale AI training 4.
While Colossus represents a significant technological achievement, it also raises questions about energy consumption and environmental impact. The power requirements for such a massive system are substantial, and the economic implications of operating and maintaining Colossus are considerable.
The launch of Colossus marks a pivotal moment in AI development. Its immense processing power could accelerate breakthroughs in various AI applications, from natural language processing to complex problem-solving. As XAI continues to expand and refine this system, the AI community eagerly anticipates the potential advancements and innovations that may emerge from this technological marvel.
Reference
[1]
Elon Musk's XAI introduces Colossus, the world's most powerful AI training system. While impressive, questions arise about its storage capacity, power usage, and naming convention.
2 Sources
Elon Musk's xAI is expanding its Colossus AI supercomputer from 100,000 to 200,000 NVIDIA Hopper GPUs, making it the world's largest AI training system. The project showcases NVIDIA's Spectrum-X Ethernet networking platform, achieving unprecedented performance in AI workloads.
13 Sources
Elon Musk's AI company, xAI, has introduced a powerful new supercomputer named 'Memphis' to train its next-generation AI model, Grok 3. The system boasts an impressive array of 100,000 Nvidia H100 GPUs, positioning it as one of the most potent AI training clusters globally.
11 Sources
Elon Musk announces his efforts to develop the world's most powerful AI, sparking debate and skepticism in the tech community. The ambitious project aims to surpass existing AI models in various metrics.
2 Sources
Nvidia CEO Jensen Huang lauds Elon Musk and xAI for constructing a supercomputer with 100,000 GPUs in just 19 days, a feat that typically takes years to accomplish.
6 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved