Curated by THEOUTPOST
On Fri, 31 Jan, 8:05 AM UTC
4 Sources
[1]
DeepSeek's R1 and OpenAI's Deep Research just redefined AI -- RAG, distillation, and custom models will never be the same
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Things are moving quickly in AI -- and if you're not keeping up, you're falling behind. Two recent developments are reshaping the landscape for developers and enterprises alike: DeepSeek's R1 model release and OpenAI's new Deep Research product. Together, they're redefining the cost and accessibility of powerful reasoning models, which has been well reported on. Less talked about, however, is how they'll push companies to use techniques like distillation, supervised fine-tuning (SFT), reinforcement learning (RL), and retrieval-augmented generation (RAG) to build smarter, more specialized AI applications. After the initial excitement around the amazing achievements of DeepSeek begins to settle, developers and enterprise decision-makers need to consider what it means for them. From pricing and performance to hallucination risks and the importance of clean data, here's what these breakthroughs mean for anyone building AI today. Cheaper, transparent, industry-leading reasoning models - but through distillation The headline with DeepSeek-R1 is simple: It delivers an industry-leading reasoning model at a fraction of the cost of OpenAI's o1. Specifically, it's about 30 times cheaper to run, and unlike many closed models, DeepSeek offers full transparency around its reasoning steps. For developers, this means you can now build highly customized AI models without breaking the bank -- whether through distillation, fine-tuning, or simple RAG implementations. Distillation, in particular, is emerging as a powerful tool. By using DeepSeek-R1 as a "teacher model," companies can create smaller, task-specific models that inherit R1's superior reasoning capabilities. These smaller models, in fact, are the future for most enterprise companies. The full R1 reasoning model can be too much for what companies need - thinking too much, and not taking the decisive action companies need for their specific domain applications. "One of the things that no one is really talking about in, certainly in the mainstream media, is that actually the reasoning models are not working that well for things like agents," said Sam Witteveen, an ML developer who works on AI agents, which are increasingly orchestrating enterprise applications. As part of its release, DeepSeek distilled its own reasoning capabilities onto a number of smaller models, including open-source models from Meta's Llama family and Alibaba's Qwen family, as described in its paper. It's these smaller models that can then be optimized for specific tasks. This trend toward smaller, fast models to serve custom-built needs will accelerate: there will be armies of them. "We are starting to move into a world now where people are using multiple models. They're not just using one model all the time," said Witteveen. And this includes the low-cost, smaller closed-sourced models from Google and OpenAI as well. "Meaning that models like Gemini Flash, GPT-4o Mini, and these really cheap models actually work really well for 80% of use cases," he said. If you work in an obscure domain, and have resources: Use SFT... After the distilling step, enterprise companies have a few options to make sure the model is ready for their specific application. If you're a company in a very specific domain, where details around the domain are not on the web or in books - where LLMs can train on them - you can inject it with your own domain-specific data sets, in a process called supervised fine tuning (SFT). One example would be the ship container-building industry, where specifications, protocols and regulations are not widely available. DeepSeek showed that you can do this well with "thousands" of question-answer data sets. For an example of how others can put this into practice, Chris Hay, an IBM engineer, demonstrated how he fine-tuned a small model using his own math-specific datasets to achieve lightning-fast responses -- outperforming OpenAI's o1 on the same tasks (See his hands-on video here) ...and a little RL Additionally, companies wanting to train a model with additional alignment to specific preferences - for example making a customer support chatbot sound empathetic while being concise - will want to do some reinforcement learning (RL) on the model. This is also good if a company wants its chatbot to adapt its tone and recommendation based on a user's feedback. As every model gets good at everything, "personality" is going to be increasingly big, said Wharton AI professor Ethan Mollick on X yesterday. These SFT and RL steps can be tricky for companies to implement well, however. Feed the model with data from one specific domain area, or tune it to act a certain way, and it suddenly becomes useless for doing tasks outside of that domain or style. For most companies, RAG will be good enough For most companies, however, retrieval-augmented generation (RAG) is the easiest and safest path forward. RAG is a relatively straight-forward process that allows organizations to ground their models with proprietary data contained in their own databases -- ensuring outputs are accurate and domain-specific. Here, an LLM feeds a user's prompt into vector and graph databases, in order to search information relevant to that prompt. RAG processes have gotten very good at finding only the most relevant content. This approach also helps counteract some of the hallucination issues associated with DeepSeek, which currently hallucinates 14% of the time compared to 8% for OpenAI's o3 model, according to a study done by Vectara, a vendor that helps companies with the RAG process. This distillation of models plus RAG is where the magic will come for most companies. It has become so incredibly easy to do, even for those with limited data science or coding expertise. I personally downloaded the DeepSeek distilled 1.5b Qwen model, the smallest one, so that it could fit nicely on my Macbook Air. I then loaded up some PDFs of job applicant resumes into a vector database, and then asked the model to look over the applicants to tell me which ones were qualified to work at VentureBeat. (In all, this took me 74 lines of code, which I basically borrowed from others doing the same). I loved that the Deepseek distilled model showed its thinking process behind why or why not it recommended each applicant -- a transparency that I wouldn't have gotten easily before Deepseek's release. In my recent video discussion on DeepSeek and RAG, I walked through how simple it has become to implement RAG in practical applications, even for non-experts. Sam Witteveen also contributed to the discussion by breaking down how RAG pipelines work and why enterprises are increasingly relying on them instead of fully fine-tuning models. (Watch it here). OpenAI Deep Research: Extending RAG's capabilities -- but with caveats While DeepSeek is making reasoning models cheaper and more transparent, OpenAI's Deep Research announced Sunday, represents a different but complementary shift. It can take RAG to a new level by crawling the web to create highly customized research. The output of this research can then be inserted as input into the RAG documents companies can use, alongside their own data. This functionality, often referred to as agentic RAG, allows AI systems to autonomously seek out the best context from across the internet, bringing a new dimension to knowledge retrieval and grounding. Open AI's Deep Research is similar to tools like Google's Deep Research, Perplexity and You.com, but OpenAI tried to differentiate its offering by suggesting its superior chain-of-thought reasoning makes it more accurate. This is how these tools work: A company researcher requests the LLM to find all the information available about a topic in a well-researched and cited report. The LLM then responds by asking the researcher to answer another 20 sub-questions to confirm what is wanted. The research LLM then goes out and performs 10 or 20 web searches to get the most relevant data to answer all those sub-questions, then extract the knowledge and present it in a useful way. However, this innovation isn't without its challenges. Amr Awadallah, the CEO of Vectara, cautioned about the risks of relying too heavily on outputs from models like Deep Research. He questions whether indeed it is more accurate: "It's not clear that this is true," Awadallah noted: "We're seeing articles and posts in various forums saying no, they're getting lots of hallucinations still and Deep Research is only about as good as other solutions out there on the market." In other words, while Deep Research offers promising capabilities, enterprises need to tread carefully when integrating its outputs into their knowledge bases. The grounding knowledge for a model should come from verified, human-approved sources to avoid cascading errors, Awadallah said. The cost curve is crashing: why this matters The most immediate impact of DeepSeek's release is its aggressive price reduction. The tech industry expected costs to come down over time, but few anticipated just how quickly it would happen. DeepSeek has proven that powerful, open models can be both affordable and efficient, creating opportunities for widespread experimentation and cost-effective deployment. Awadallah emphasized this point, noting that the real game-changer isn't just the training cost -- it's the inference cost, which for DeepSeek is about 1/30th of OpenAI's o1 or o3 for inference cost per token. "The margins that OpenAI, Anthropic, and Google Gemini were able to capture will now have to be squished by at least 90% because they can't stay competitive with such high pricing," Awadallah said. Not only that, but those costs will continue to go down. Dario Amodei, CEO of Anthropic said recently that the cost of developing models continues to drop at around a 4x rate each year. It follows that the rate that LLM providers charge to use them will continue to drop as well. "I fully expect the cost to go to zero," said Ashok Srivastava, chief data officer of Intuit, a company that has been driving AI hard in its tax and accounting software offerings like TurboTax and Quickbooks. "...and the latency to go to zero. They're just going to be commodity capabilities that we will be able to use." This cost reduction isn't just a win for developers and enterprise users; it's a signal that AI innovation is no longer confined to big labs with billion-dollar budgets. The barriers to entry have dropped, and that's inspiring smaller companies and individual developers to experiment in ways that were previously unthinkable. Most importantly, the models are so accessible that any business professional will be using them, not just AI experts, said Srivastava. DeepSeek's disruption: Challenging "Big AI's" stronghold on model development Most importantly, DeepSeek has shattered the myth that only major AI labs can innovate. For years, companies like OpenAI and Google positioned themselves as the gatekeepers of advanced AI, spreading the belief that only top-tier PhDs with vast resources could build competitive models. DeepSeek has flipped that narrative. By making reasoning models open and affordable, it has empowered a new wave of developers and enterprise companies to experiment and innovate without needing billions in funding. This democratization is particularly significant in the post-training stages -- like RL and fine-tuning -- where the most exciting developments are happening. DeepSeek exposed a fallacy that had emerged in AI -- that only the big AI labs and companies could really innovate. This fallacy had forced a lot of other AI builders to the sidelines. DeepSeek has put a stop to that. It has given everyone inspiration that there's a ton of ways to innovate in this area. The Data imperative: Why clean, curated data is the next action-item for enterprise companies While DeepSeek and Deep Research offer powerful tools, their effectiveness ultimately hinges on one critical factor: data quality. Getting your data in order has been a big theme for years, and accelerated over the past nine years of the AI era. But it has become even more important with generative AI, and now with DeepSeek's disruption, it's absolutely key. Hilary Packer, CTO of American Express, underscored this in an interview with VentureBeat yesterday: "The AHA moment for us, honestly, was the data. You can make the best model selection in the world... but the data is key. Validation and accuracy are the holy grail right now of generative AI." This is where enterprises must focus their efforts. While it's tempting to chase the latest models and techniques, the foundation of any successful AI application is clean, well-structured data. Whether you're using RAG, SFT, or RL, the quality of your data will determine the accuracy and reliability of your models. And while many companies aspire to perfect their entire data ecosystems, the reality is that perfection is elusive. Instead, businesses should focus on cleaning and curating the most critical portions of their data to enable point AI applications that deliver immediate value. Related to this, a lot of questions linger around the exact data that DeepSeek used to train its models on, and this raises questions about the inherent bias of the knowledge stored in its model weights. But that's no different from questions around other open source models, such as Meta's Llama model series. Most enterprise users have found ways to fine-tune or ground the models with RAG enough so that they can mitigate any problems around such biases. And that's been enough to create serious momentum within enterprise companies toward accepting open source, indeed even leading with open source. Similarly, there's no question that many companies will be using DeepSeek models, regardless of the fear around the fact that the company is from China. Though it's also true that a lot of companies in highly regulated companies such as finance or healthcare are going to be cautious about using any DeepSeek model in any application that interfaces directly with customers, at least in the short-term. Conclusion: The future of enterprise AI Is open, affordable, and data-driven DeepSeek and OpenAI's Deep Research are more than just new tools in the AI arsenal -- they're signals of a profound shift, where enterprises will be rolling out masses of purpose-built models, extremely affordably, competent, and grounded in the company's own data and approach. For enterprises, the message is clear: the tools to build powerful, domain-specific AI applications are at your fingertips. You risk falling behind if you don't leverage these tools. But real success will come from how you curate your data, leverage techniques like RAG and distillation, and innovate beyond the pre-training phase. As AmEx's Packer put it, the companies that get their data right will be the ones leading the next wave of AI innovation.
[2]
Remember DeepSeek? Two New AI Models Say They're Even Better - Decrypt
AI companies used to measure themselves against industry leader OpenAI. No more. Now that China's DeepSeek has emerged as the frontrunner, it's become the one to beat. On Monday, DeepSeek turned the AI industry on its head, causing billions of dollars in losses on Wall Street while raising questions about how efficient some U.S. startups -- and venture capital -- actually are. Now, two new AI powerhouses have entered the ring: The Allen Institute for AI in Seattle and Alibaba in China; both claim their models are on a par with or better than DeepSeek V3. The Allen Institute for AI, a U.S.-based research organization known for the release of a more modest vision model named Molmo, today unveiled a new version of Tülu 3, a free, open-source 405-billion parameter large language model. "We are thrilled to announce the launch of Tülu 3 405B -- the first application of fully open post-training recipes to the largest open-weight models," the Paul Allen-funded non-profit said in a blog post. "With this release, we demonstrate the scalability and effectiveness of our post-training recipe applied at 405B parameter scale." For those who like comparing sizes, Meta's latest LLM, Llama-3.3, has 70 billion parameters, and its largest model to date is Llama-3.1 405b -- the same size as Tülu 3. The model was so big that it demanded extraordinary computational resources, requiring 32 nodes with 256 GPUs running in parallel for training. The Allen Institute hit several roadblocks while building its model. The sheer size of Tülu 3 meant the team had to split the workload across hundreds of specialized computer chips, with 240 chips handling the training process while 16 others managed real-time operations. Even with this massive computing power, the system frequently crashed and required round-the-clock supervision to keep it running. Tülu 3's breakthrough centered on its novel Reinforcement Learning with Verifiable Rewards (RLVR) framework, which showed particular strength in mathematical reasoning tasks. Each RLVR iteration took approximately 35 minutes, with inference requiring 550 seconds, weight transfer 25 seconds, and training 1,500 seconds, with the AI getting better at problem-solving with each round. Reinforcement Learning with Verifiable Rewards (RLVR) is a training approach that seems like a sophisticated tutoring system. The AI received specific tasks, like solving math problems, and got instant feedback on whether its answers were correct. However, unlike traditional AI training (like the one used by openAI to train ChatGPT), where human feedback can be subjective, RLVR only rewarded the AI when it produced verifiably correct answers, similar to how a math teacher knows exactly when a student's solution is right or wrong. This is why the model is so good at math and logic problems but not the best at other tasks like creative writing, roleplay, or factual analysis. The model is available at Allen AI's playground, a free site with a UI similar to ChatGPT and other AI chatbots. Our tests confirmed what could be expected from a model this big. It is very good at solving problems and applying logic. We provided different random problems from a number of math and science benchmarks and it was able to output good answers, even easier to understand when compared to the sample answers that benchmarks provided. However, it failed in other logical language-related tasks that didn't involve math, such as writing sentences that end in a specific word. Also, Tülu 3 isn't multimodal. Instead, it stuck to what it knew best -- churning out text. No fancy image generation or embedded Chain-of-Thought tricks here. On the upside, the interface is free to use, requiring a simple login, either via Allen AI's playground or by downloading the weights to run locally. The model is available for download via Hugging Face, with alternatives going from 8 billion parameters to the gigantic 405 billion parameters version. Meanwhile, China isn't resting on DeepSeek's laurels. Amid all the hubbub, Alibaba dropped Qwen 2.5-Max, a massive language model trained on over 20 trillion tokens. The Chinese tech giant released the model during the Lunar New Year, just days after DeepSeek R1 disrupted the market. Benchmark tests showed Qwen 2.5-Max outperformed DeepSeek V3 in several key areas, including coding, math, reasoning, and general knowledge, as evaluated using benchmarks like Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond. The model demonstrated competitive results against industry leaders like GPT-4o and Claude 3.5-Sonne,t according to the model's card. Alibaba made the model available through its cloud platform with an OpenAI-compatible API, allowing developers to integrate it using familiar tools and methods. The company's documentation showed detailed examples of implementation, suggesting a push for widespread adoption. But Alibaba's Qwen Chat web portal is the best option for general users and seems pretty impressive -- for those who are okay with creating an account there. It is probably the most versatile AI chatbot interface currently available. Qwen Chat allows users to generate text, code, and images flawlessly. It also supports web search functionality, artifacts, and even a very good video generator, all in the same UI -- for free. It also has a unique function in which users can choose two different models to "battle" against each other to provide the best response. Overall, Qwen's UI is more versatile than Allen AI's. In text responses, Qwen2.5-Max proved to be better than Tülu 3 at creative writing and reasoning tasks that involved language analysis. For example, it was capable of generating phrases ending in a specific word. Its video generator is a nice addition and is arguably on par with offers like Kling or Luma Labs -- definitely better than what Sora can make. Also, its image generator provides realistic and pleasant images, showing a clear advantage over OpenAI's DALL-E 3, but clearly behind top models like Flux or MidJourney. The triple release of DeepSeek, Qwen2.5-Max, and Tülu 3 just gave the open-source AI world its most significant boost in a while. DeepSeek had already turned heads by building its R1 reasoning model using earlier Qwen technology for distillation, proving open-source AI could match billion-dollar tech giants at a fraction of the cost. And now Qwen2.5-Max has upped the ante. If DeepSeek follows its established playbook -- leveraging Qwen's architecture -- its next reasoning model could pack an even bigger punch. Still, this could be a good opportunity for the Allen Institute. OpenAI is racing to launch its o3 reasoning model, which some industry analysts estimated could cost users up to $1,000 per query. If so, Tülu 3's arrival could be a great open-source alternative -- especially for developers wary of building on Chinese technology due to security concerns or regulatory requirements.
[3]
What DeepSeek Means for Open-Source AI
You've likely heard of DeepSeek: The Chinese company released a pair of open large language models( LLMs), DeepSeek-V3 and DeepSeek-R1, in December 2024, making them available to anyone for free use and modification. Then, in January, the company released a free chatbot app, which quickly gained popularity and rose to the top spot in Apple's app store. The DeepSeek models' excellent performance, which rivals the best closed LLMs from OpenAI and Anthropic, spurred a stock market route on 27 January that wiped off more than US $600 billion from leading AI stocks. Proponents of open AI models, however, have met DeepSeek's releases with enthusiasm. Over 700 models based on DeepSeek-V3 and R1 are now available on the AI community platform HuggingFace. Collectively, they've received over five million downloads. Cameron R. Wolfe, a senior research scientist at Netflix, says the enthusiasm is warranted. "DeepSeek-V3 and R1 legitimately come close to matching closed models. Plus, the fact that DeepSeek was able to make such a model under strict hardware limitations due to American export controls on Nvidia chips is impressive." It's that second point -- hardware limitations due to U.S. export restrictions in 2022 -- that highlights DeepSeek's most surprising claims. The company says the DeepSeek-V3 model cost roughly $5.6 million to train using Nvidia's H800 chips. The H800 is a less performant version of Nvidia hardware that was designed to pass the standards set by the U.S. export ban. A ban meant to stop Chinese companies from training top-tier LLMs. (The H800 chip was also banned in October 2023.) DeepSeek achieved impressive results on less capable hardware with a "DualPipe" parallelism algorithm designed to get around the Nvidia H800's limitations. It uses low-level programming to precisely control how training tasks are scheduled and batched. The model also uses a "mixture-of-experts" (MoE) architecture which includes many neural networks, the "experts," which can be activated independently. Because each expert is smaller and more specialized, less memory is required to train the model, and compute costs are lower once the model is deployed. The result is DeepSeek-V3, a large language model with 671 billion parameters. While OpenAI doesn't disclose the parameters in its cutting-edge models, they're speculated to exceed one trillion. Despite that, DeepSeek V3 achieved benchmark scores that matched or beat OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet. And DeepSeek-V3 isn't the company's only star; it also released a reasoning model, DeepSeek-R1, with chain-of-thought reasoning like OpenAI's o1. While R1 isn't the first open reasoning model, it's more capable than prior ones, such as Alibiba's QwQ. As with DeepSeek-V3, it achieved its results with an unconventional approach. Most LLMs are trained with a process that includes supervised fine-tuning (SFT). This technique samples the model's responses to prompts, which are then reviewed and labelled by humans. Their evaluations are fed back into training to improve the model's responses. It works, but having humans review and label the responses is time-consuming and expensive. DeepSeek first tried ignoring SFT and instead relied on reinforcement learning (RL) to train DeepSeek-R1-Zero. A rules-based reward system, described in the model's whitepaper, was designed to help DeepSeek-R1-Zero learn to reason. But this approach led to issues, like language mixing (the use of many languages in a single response), that made its responses difficult to read. To get around that, DeepSeek-R1 used a "cold start" technique that begins with a small SFT dataset of just a few thousand examples. From there, RL is used to complete the training. Wolfe calls it a "huge discovery that's very non-trivial." For Rajkiran Panuganti, senior director of generative AI applications at the Indian company Krutrim, DeepSeek's gains aren't just academic. Krutrim provides AI services for clients and has used several open models, including Meta's Llama family of models, to build its products and services. Panuganti says he'd "absolutely" recommend using DeepSeek in future projects. "The earlier Llama models were great open models, but they're not fit for complex problems. Sometimes they're not able to answer even simple questions, like how many times does the letter 'r' appear in strawberry," says Panuganti. He cautions that DeepSeek's models don't beat leading closed reasoning models, like OpenAI's o1, which may be preferable for the most challenging tasks. However, he says DeepSeek-R1 is "many multipliers" less expensive. And that's if you're paying DeepSeek's API fees. While the company has a commercial API that charges for access for its models, they're also free to download, use, and modify under a permissive license. Better still, DeepSeek offers several smaller, more efficient versions of their main models, known as "distilled models." These have fewer parameters, making them easier to run on less powerful devices. YouTuber Jeff Geerling has already demonstrated DeepSeek R1 running on a Raspberry Pi. Popular interfaces for running an LLM locally on one's own computer, like Ollama, already support DeepSeek R1. I had DeepSeek-R1-7B, the second-smallest distilled model, running on a Mac Mini M4 with 16 gigabytes of RAM in less than 10 minutes. While DeepSeek is "open," some details are left behind the wizard's curtain. DeepSeek doesn't disclose the datasets or training code used to train its models. This is a point of contention in open-source communities. Most "open" models only provide the model weights necessary to run or fine-tune the model. The full training dataset, as well as the code used in training, remains hidden. Stefano Maffulli, director of the Open Source Initiative, has repeatedly called out Meta on social media, saying its decision to label its Llama model as open source is an "outrageous lie." DeepSeek's models are similarly opaque, but HuggingFace is trying to unravel the mystery. On 28 January, it announced Open-R1, an effort to create a fully open-source version of DeepSeek-R1. "Reinforcement learning is notoriously tricky, and small implementation differences can lead to major performance gaps," says Elie Bakouch, an AI research engineer at HuggingFace. The compute cost of regenerating DeepSeek's dataset, which is required to reproduce the models, will also prove significant. However, Bakouch says HuggingFace has a "science cluster" that should be up to the task. Researchers and engineers can follow Open-R1's progress on HuggingFace and Github. Regardless of Open-R1's success, however, Bakouch says DeepSeek's impact goes well beyond the open AI community. "The excitement isn't just in the open-source community, it's everywhere. Researchers, engineers, companies, and even non-technical people are paying attention," he says.
[4]
Ai2 releases Tülu 3, a fully open source model that bests DeepSeek v3, GPT-4o with novel post-training approach
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The open source model race just keeps on getting more interesting. Today, The Allen Institute for AI (Ai2) debuted its latest entrant in the race with the launch of its open source Tülu 3 405B parameter large language model (LLM). The new model not only matches Open AI's GPT-4o's capabilities but also surpasses DeepSeek's V3 model across critical benchmarks. This isn't the first time the Ai2 has made bold claims about a new model. In Nov. 2024 the company released its first version of Tülu 3, which had both 8 and 70 billion parameter versions. At the time, Ai2 claimed the model was up to par with the latest GPT-4 model from OpenAI, Anthropic's Claude and Google's Gemini. The big difference being that Tülu 3 is open source. Ai2 had also claimed back in Sept. 2024 that its Molmo models were able to beat GPT-4o and Claude on some benchmarks. While benchmark performance data is interesting, what's perhaps more useful is the training innovations that enable the new Ai2 model. Pushing post-training to the limit The big breakthrough for Tülu 3 405B is rooted in an innovation that first appeared with the initial Tülu 3 release in 2024. That release utilizes a combination of advanced post-training techniques to get better performance. With the Tülu 3 405B model, those post-training techniques have been pushed even further, using an advanced post-training methodology that combines supervised fine-tuning, preference learning, and a novel reinforcement learning approach that has proven exceptional at larger scales. "Applying Tülu 3's post-training recipes to Tülu 3-405B, our largest-scale, fully open-source post-trained model to date, levels the playing field by providing open fine-tuning recipes, data, and code, empowering developers and researchers to achieve performance comparable to top-tier closed models," Hannaneh Hajishirzi, senior director of NLP Research at Ai2 told VentureBeat. Advancing the state of open source AI post-training with RLVR Post-training is something that other models, including DeepSeek v3, do as well. The key innovation that helps to differentiate Tülu 3 is Ai2's Reinforcement Learning from Verifiable Rewards (RLVR) system. Unlike traditional training approaches, RLVR uses verifiable outcomes -- such as solving mathematical problems correctly -- to fine-tune the model's performance. This technique, when combined with Direct Preference Optimization (DPO) and carefully curated training data, has enabled the model to achieve better accuracy in complex reasoning tasks while maintaining strong safety characteristics. Key technical innovations in the RLVR implementation include: The RLVR system showed improved results at the 405B parameter scale compared to smaller models. The system also demonstrated particularly strong results in safety evaluations, outperforming both DeepSeek V3 , Llama 3.1 and Nous Hermes 3. Notably, the RLVR framework's effectiveness increased with model size, suggesting potential benefits from even larger-scale implementations. How Tülu 3 405B compares to GPT-4o and DeepSeek v3 The model's competitive positioning is particularly noteworthy in the current AI landscape. Tülu 3 405B not only matches the capabilities of GPT-4o but also outperforms DeepSeek v3 in some areas particularly with safety benchmarks. Across a suite of 10 AI benchmarks evaluation including safety benchmarks, Ai2 reported that the Tülu 3 405B RLVR model had an average score of 80.7, surpassing DeepSeek V3's 75.9. Tülu however is not quite as good at GPT-4o which scored 81.6. Overall the metrics suggest that Tülu 3 405B is at the very least extremely competitive with GPT-4o and DeepSeek V3 across the benchmarks. Why open source AI matters and how Ai2 is doing it differently What makes Tülu 3 405B different for users though is how Ai2 has made the model available. There is a lot of noise in the AI market about open source. DeepSeek says it's open source and so is Meta's Llama 3.1, which Tülu 3 405B also outperforms. With both DeepSeek and Llama the models are freely available for use; there is some, but not all code available. For example, DeepSeek-R1 has released its model code and pre-trained weights but not the training data. Ai2 is taking a differentiated approach in an attempt to be more open. "We don't leverage any closed datasets," Hajishirzi said. "As with our first Tulu 3 release in November 2024, we are releasing all of the infrastructure code." She added that Ai2's fully-open approach which includes data, training code and models ensures users can easily customize their pipeline for everything from data selection through evaluation. Users can access the full suite of Tulu 3 models, including Tulu 3-405B, on Ai2's Tulu 3 page here, or test the Tulu 3-405B functionality through Ai2's Playground demo space here.
Share
Share
Copy Link
Recent developments in AI models from DeepSeek, Allen Institute, and Alibaba are reshaping the landscape of artificial intelligence, challenging industry leaders and pushing the boundaries of what's possible in language processing and reasoning capabilities.
DeepSeek, a Chinese AI company, has recently made waves in the artificial intelligence sector with the release of its open-source large language models (LLMs), DeepSeek-V3 and DeepSeek-R1 1. These models have demonstrated performance rivaling that of industry leaders like OpenAI and Anthropic, despite being developed under hardware limitations due to U.S. export controls 3.
The company's achievements are particularly noteworthy given the constraints they faced. DeepSeek claims to have trained their V3 model for approximately $5.5 million using Nvidia's H800 chips, which were designed to comply with U.S. export restrictions 3. This feat was made possible through innovative techniques such as the "DualPipe" parallelism algorithm and a "mixture-of-experts" (MoE) architecture, allowing for efficient training and deployment 3.
In response to DeepSeek's breakthrough, the Allen Institute for AI has unveiled Tülu 3, a 405-billion parameter LLM that claims to match or surpass the capabilities of both DeepSeek V3 and OpenAI's GPT-4o 4. Tülu 3's development faced significant challenges, requiring 32 nodes with 256 GPUs running in parallel for training 2.
The model's key innovation lies in its novel Reinforcement Learning with Verifiable Rewards (RLVR) framework, which has shown particular strength in mathematical reasoning tasks 2. This approach, combined with other post-training techniques, has enabled Tülu 3 to achieve competitive results across various benchmarks 4.
Not to be outdone, Chinese tech giant Alibaba has introduced Qwen 2, a massive language model trained on over 20 trillion tokens 2. Benchmark tests indicate that Qwen 2 outperforms DeepSeek V3 in several key areas, including coding, math, reasoning, and general knowledge 2.
Alibaba has made Qwen 2 available through its cloud platform with an OpenAI-compatible API, facilitating easy integration for developers 2. The company's Qwen Chat web portal offers a versatile interface for general users, supporting text, code, and image generation, as well as web search functionality 2.
The release of these powerful open-source models has significant implications for the AI community. Over 700 models based on DeepSeek-V3 and R1 are now available on the AI community platform HuggingFace, with over five million downloads collectively 3.
Cameron R. Wolfe, a senior research scientist at Netflix, notes that DeepSeek's models "legitimately come close to matching closed models," highlighting the potential for open-source AI to compete with proprietary solutions 3. This democratization of AI technology could lead to increased innovation and accessibility in the field.
While these developments are promising, challenges remain. DeepSeek's models, for instance, have shown a higher rate of hallucination compared to some competitors 1. Additionally, the "openness" of these models varies, with some companies not disclosing full training datasets or code 3.
As the AI model race continues to heat up, it's clear that open-source solutions are becoming increasingly competitive with their closed-source counterparts. This trend could reshape the AI landscape, potentially leading to more accessible and transparent AI technologies in the future.
Reference
[1]
[3]
Chinese AI startup DeepSeek has disrupted the AI industry with its cost-effective and powerful AI models, causing significant market reactions and challenging the dominance of major U.S. tech companies.
14 Sources
14 Sources
Chinese AI company DeepSeek's new large language model challenges US tech dominance, sparking debates on open-source AI and geopolitical implications.
9 Sources
9 Sources
Chinese AI startup DeepSeek has disrupted the global AI market with its efficient and powerful models, sparking both excitement and controversy in the tech world.
6 Sources
6 Sources
DeepSeek's open-source R1 model challenges OpenAI's o1 with comparable performance at a fraction of the cost, potentially revolutionizing AI accessibility and development.
6 Sources
6 Sources
DeepSeek R1, a new open-source AI model, demonstrates advanced reasoning capabilities comparable to proprietary models like OpenAI's GPT-4, while offering significant cost savings and flexibility for developers and researchers.
21 Sources
21 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved