Curated by THEOUTPOST
On Sat, 28 Dec, 12:02 AM UTC
2 Sources
[1]
U.S' AI Hardware Restrictions on China Have Backfired
"The risk of an asteroid hitting the Earth or a pandemic also exists. But the risk of China destroying our system is significantly larger in my opinion," VC Vinod Khosla said. Chinese research firm DeepSeek on Thursday unveiled DeepSeek-V3, the strongest open-source model out there. While Chinese models have caught up with frontier models from the West over the last few months, DeepSeek paints a different picture this time. The company was able to train the model with just around $5.5 million, a cost that is significantly lower than many other models in this segment. Over the last few years, the United States has been imposing several embargos and export sanctions on NVIDIA GPUs to China. Given DeepSeek-V3's performance results and cost efficiency, these sanctions seem to have had a counterproductive effect. This has pushed Chinese engineers to focus on building models with unprecedented efficiency, considering the few resources that they have. DeepSeek-V3 is a large, 671 billion parameter model trained on 2.788 million NVIDIA H800 GPU hours. The model outperforms Meta's 405 billion parameter Llama 3.1 in most benchmarks and even closed source Claude 3.5 Sonnet and GPT-4o in several tests. This cost DeepSeek a total of $5.576 million, which includes pre-training, context extension, and post-training. Earlier this year, research institute EpochAI released a technical paper which revealed the staggering costs of training frontier models. "We find that the most expensive publicly-announced training runs to date are OpenAI's GPT-4 at $40 million and Google's Gemini Ultra at $30 million," read the report. DeepSeek is also an incredibly cost-effective model for API usage. It is currently priced at $0.14 per million tokens during input and $0.28 per million tokens for output until February 8, 2025. Eventually, it will cost $0.27 per million tokens during input and $1.10 per million tokens during output. OpenAI's GPT-4o costs $2.50 per million tokens for input and $10.00 per million tokens for output. "To run DeepSeek v3 24/7 at 60 tokens per second (5x human reading speed) is $2 a day," said Emad Mostaque, founder of Stability AI, who compared it to being as cheap as a cup of latte. DeepSeek-V3's technical paper reveals all the magic inside the model's architecture. Techniques like FP8 precision training, optimisation in the infrastructure algorithms, and the training framework are what make the model achieve it all, along with the fact that it is open source. The model is available on the web for free and also supports real-time information through web search. In a recent interview, Elon Musk, CEO of xAI, said that training the Grok 2 model took about 20,000 NVIDIA H100 GPUs. He added that training the Grok 3 models will require 1 lakh NVIDIA H100 GPUs. Meta also revealed that it is using more than 1 lakh NVIDIA H100 GPUs to train the upcoming Llama 4 models. "[This is] bigger than anything that I've seen reported for what others are doing," said Meta chief Mark Zuckerberg in the company's earnings report released in October. In contrast, DeepSeek-V3 was trained on 2,048 NVIDIA H800 GPUs. Owing to US President Joe Biden's administration restrictions, the NVIDIA H800 is a GPU designed to comply with export regulations in the Chinese market with a data transfer rate slashed by 50%. The H100 offers a transfer rate of 600 gigabytes per second, compared to the H800's 300 gigabytes per second. This does raise concerns about whether the frontier model makers are underutilising compute. "For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs," Andrej Karpathy, former OpenAI researcher, said in a post on X. "You have to ensure that you're not wasteful with what you have, and this (DeepSeek-V3) looks like a nice demonstration that there's still a lot to get through with both data and algorithms," he added. Soon after, the US also banned the export of NVIDIA's H800 to China and prevented the company from selling chips even with a reduced transfer rate. While there is no official disclosure of the number of H800 GPUs exported to China, an investigation suggested that there is an underground network of around 70 sellers who claim to receive dozens of GPUs every month. Another report also revealed that NVIDIA's chips are reaching China as a part of server products from Dell, Supermicro, etc. Recently, the US Department of Commerce asked NVIDIA to investigate how its produce has reached China. While it isn't clear whether DeepSeek purchased NVIDIA's H800s while it was being legally exported, their work is everything that the US government did not wish to see. The difficulty of purchasing powerful hardware has led China to intensely prioritise its focus on optimisations at the model architecture level. Amjad Masad, CEO of AI-enabled coding platform Replit, said on X, "The Chinese [have] innovated a way to train large models for cheap. Regulators never consider second-order effects." Most of the techniques outlined in the paper indicate that the researchers at DeepSeek mostly focus on problems that LLMs face under resource constraints. Bojan Tunguz, a former engineer at NVIDIA, said on X, "All the export bans on high-end semiconductors might have actually been counterproductive in the 'worst' way imaginable." Several social media users also speculate what would occur if the restrictions weren't present in the first place. If not for the chip embargo, China would have built AGI in months, said a user on X. DeepSeek doesn't wish to stop here either. "We will consistently study and refine our model architectures, aiming to further improve both the training and inference efficiency, striving to approach efficient support for infinite context length," the researchers said in the report. "Additionally, we will try to break through the architectural limitations of a transformer, thereby pushing the boundaries of its modelling capabilities," they added. That said, fears of China using the best of technology stems from concerns about how China might use it for military purposes. The US government states that China will use "advanced computing chips" to produce weapons of mass destruction. "The PRC has poured resources into developing supercomputing capabilities and seeks to become a world leader in artificial intelligence by 2030. It is using these capabilities to monitor, track, and surveil its own citizens and fuel its military modernisation," said Thea D Rozman Kendler, assistant secretary of commerce for export administration. This sentiment is also echoed by leaders in the private sector. Vinod Khosla, a venture capital who has actively backed OpenAI, said in a essay titled 'AI: Utopia or Dystopia', said, "China is the fastest way [to making] the doomers' nightmares come true." "We may have to worry about sentient AI destroying humanity, but the risk of an asteroid hitting the Earth or a pandemic also exists. But the risk of China destroying our system is significantly larger in my opinion," Khosla said, referring to China as a "bad actor". While DeepSeek's recent development would give the US government sleepless nights, the reality may not be as fearsome as it is made out to be. The US government may well be masquerading China's economic threat as a civilian threat, as elaborated in a report published by the Carnegie Endowment for International Peace titled 'US-China Relations for the 2030s: Toward a Realistic Scenario for Coexistence'. "It [US] is uncomfortable with the possibility of a true peer competitor rising and views this as a threat. China, which has been rising for decades, reached some key landmarks recently; it became the world's top manufacturing and trading nation, as well as the world's second-most capable military power," read the report.
[2]
Chinese AI company's AI model breakthrough highlights limits of US sanctions
DeepSeek, a Chinese AI startup, says it has trained an AI model comparable to the leading models from heavyweights like Meta and Anthropic, but at an 11X reduction in the amount of GPU computing, and thus cost, required. The startling announcement suggests that while US sanctions have impacted the availability of AI hardware in China, clever scientists are working to extract the utmost performance from limited amounts of hardware. These types of advances could ultimately reduce the impact of choking off China's supply of AI chips. Deepseek trained its DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster containing 2,048 Nvidia H800 GPUs in just two months, which means 2.8 million GPU hours, according to its paper. For comparison, it took Meta 11 times more compute power (30.8 million GPU hours) to train its Llama 3 with 405 billion parameters using a cluster containing 16,384 H100 GPUs over the course of 54 days. DeepSeek claims it has significantly reduced the compute and memory demands typically required for models of this scale using advanced pipeline algorithms, optimized communication framework, and FP8 low-precision computation as well as communication. The company used a cluster of 2,048 Nvidia H800 GPUs, each equipped with NVLink interconnects for GPU-to-GPU and InfiniBand interconnects for node-to-node communications. In such setups, inter-GPU communications are rather fast, but inter-node communications are not, so optimizations are key to performance and efficiency. While DeepSeek implemented tens of optimization techniques to reduce the compute requirements of its DeepSeek-v3, several key technologies enabled its impressive results. DeepSeek used the DualPipe algorithm to overlap computation and communication phases within and across forward and backward micro-batches and, therefore, reduced pipeline inefficiencies. In particular, dispatch (routing tokens to experts) and combine (aggregating results) operations were handled in parallel with computation using customized PTX (Parallel Thread Execution) instructions, which means writing low-level, specialized code that is meant to interface with Nvidia CUDA GPUs and optimize their operations. The DualPipe algorithm minimized training bottlenecks, particularly for the cross-node expert parallelism required by the MoE architecture, and this optimization allowed the cluster to process 14.8 trillion tokens during pre-training with near-zero communication overhead, according to DeepSeek. In addition to implementing DualPipe, DeepSeek restricted each token to a maximum of four nodes to limit the number of nodes involved in communication. This reduced traffic and ensured that communication and computation could overlap effectively. A critical element in reducing compute and communication requirements was the adoption of low-precision training techniques. DeepSeek employed an FP8 mixed precision framework, enabling faster computation and reduced memory usage without compromising numerical stability. Key operations, such as matrix multiplications, were conducted in FP8, while sensitive components like embeddings and normalization layers retained higher precision (BF16 or FP32) to ensure accuracy. This approach reduced memory requirements while maintaining robust accuracy, with the relative training loss error consistently under 0.25%. When it comes to performance, the company says the DeepSeek-v3 MoE language model is comparable to or better than GPT-4x, Claude-3.5-Sonnet, and LLlama-3.1, depending on the benchmark. Naturally, we'll have to see that proven with third-party benchmarks. The company has open-sourced the model and weights, so we can expect testing to emerge soon. While the DeepSeek-V3 may be behind frontier models like GPT-4o or o3 in terms of the number of parameters or reasoning capabilities, DeepSeek's achievements indicate that it is possible to train an advanced MoE language model using relatively limited resources. Of course, this requires a lot of optimizations and low-level programming, but the results appear to be surprisingly good. The DeepSeek team recognizes that deploying the DeepSeek-V3 model requires advanced hardware as well as a deployment strategy that separates the prefilling and decoding stages, which might be unachievable for small companies due to a lack of resources. "While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment," the company's paper reads. "Firstly, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might pose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware."
Share
Share
Copy Link
Chinese AI company DeepSeek unveils a highly efficient large language model, DeepSeek-V3, trained at a fraction of the cost of Western counterparts, raising questions about the effectiveness of US chip export restrictions.
DeepSeek, a Chinese AI startup, has introduced DeepSeek-V3, a large language model that challenges the effectiveness of US chip export restrictions. This 671 billion parameter model demonstrates remarkable efficiency, having been trained at a fraction of the cost typically associated with comparable models from Western tech giants 1.
DeepSeek-V3 reportedly outperforms Meta's 405 billion parameter Llama 3 in most benchmarks and even surpasses closed-source models like Claude 3 Sonnet and GPT-4 in several tests. The company achieved this feat with just $5 million in training costs, significantly lower than the estimated $30-40 million spent on models like GPT-4 and Google's Gemini Ultra 1.
The model's efficiency stems from several key innovations:
DeepSeek-V3 was trained on 2,048 NVIDIA H800 GPUs, which were designed for the Chinese market with reduced data transfer rates to comply with US export regulations. This achievement raises questions about the effectiveness of US chip export restrictions, as Chinese engineers have been pushed to focus on building models with unprecedented efficiency given their limited resources 1.
The AI community has expressed surprise at DeepSeek's accomplishment. Andrej Karpathy, a former OpenAI researcher, noted that this level of capability was previously thought to require much larger GPU clusters 1. Amjad Masad, CEO of Replit, suggested that regulators may not have considered the second-order effects of their restrictions 1.
While DeepSeek-V3 represents a significant advancement, the company acknowledges some limitations, particularly in deployment. The model requires advanced hardware and a specific deployment strategy, which may be challenging for smaller companies with limited resources 2.
DeepSeek plans to continue refining its model architectures, aiming to further improve both training and inference efficiency. This ongoing research could potentially lead to even more cost-effective and powerful AI models in the future 1.
Reference
[1]
Chinese AI company DeepSeek's new large language model challenges US tech dominance, sparking debates on open-source AI and geopolitical implications.
9 Sources
9 Sources
Chinese AI startup DeepSeek has disrupted the global AI market with its efficient and powerful models, sparking both excitement and controversy in the tech world.
6 Sources
6 Sources
Chinese AI startup DeepSeek has disrupted the AI industry with its cost-effective and powerful AI models, causing significant market reactions and challenging the dominance of major U.S. tech companies.
14 Sources
14 Sources
China's AI industry is experiencing rapid growth, surpassing American rivals in some areas. This surge, backed by state support, raises questions about global AI competition and its impact on the business landscape.
3 Sources
3 Sources
Chinese AI startup DeepSeek has shaken the tech industry with its cost-effective and powerful AI model, causing market turmoil and raising questions about the future of AI development and investment.
49 Sources
49 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved