2 Sources
2 Sources
[1]
AI can learn to show its workings through trial and error
You have full access to this article via Jozef Stefan Institute. When a student encounters a challenging mathematics problem or a programmer needs to write a complex algorithm, they will rarely solve it all in one go. Instead, they will reason through the task, jotting down notes and intermediate steps to arrive at a final solution. Likewise, large language models (LLMs) -- artificial intelligence (AI) systems that process and generate human language -- perform better at complex tasks when they write down their reasoning process before blurting out an answer than when they do not. In a paper in Nature, the DeepSeek AI team reports that LLMs can be incentivized to learn to reason without ever being shown examples of human reasoning trajectories, using a trial-and-error process called reinforcement learning. So, what needs to be done to get an LLM to write out its reasoning process? Early efforts to elicit reasoning in LLMs simply added an extra instruction. Instead of prompting the LLM with "Q: Is 119 a prime number? A:" and expecting it to answer yes or no, researchers might input "Q: Is 119 prime? A: Let's think step by step." A small change in language was enough to induce the LLM to produce a step-by-step explanation -- called a reasoning trace -- before giving its answer. Other efforts taught LLMs to show their reasoning by presenting them with examples of humans using reasoning to solve problems. The LLM then learnt to produce reasoning traces that looked like the ones in the data -- this is called supervised learning. However, prompting or training the LLM using human inputs can introduce biases, and these approaches prevent the model from developing its own ways of reasoning, which might perform better than human examples. The researchers introduced a paradigm for eliciting reasoning steps from LLMs that are separate from the production of an answer. They implemented this in a model called DeepSeek-R1, which was released in January 2025. Rather than hoping that the LLM would reason when it was instructed to do so, or guiding it using examples of the human reasoning process, the researchers used a type of algorithm called reinforcement learning. Reinforcement-learning algorithms resemble how a child might learn to play a video game. As the child navigates their avatar through the game world, they learn through trial and error that some actions (such as collecting gold coins) earn points, whereas others (such as running into enemies) set their score back to zero. In a similar vein, DeepSeek-R1 was awarded a high score when it answered questions correctly and a low score when it gave wrong answers. The researchers realized that, because maths and programming questions typically have verifiable answers, they could create a scoring system that helped the LLM to improve during the training process. The researchers' main discovery was that, when the LLM was trained to produce correct answers using the trial-and-error process of reinforcement learning, it naturally learnt to output its reasoning (Fig. 1). This contrasts with previous prompting-based approaches, which were more akin to expecting a child to learn to master a video game by having them read the instructions, or supervised-learning approaches, which can be likened to expecting the child to master a game by watching a sibling play it hundreds of times. Because it was trained using reinforcement learning, the LLM was not limited to learning human-defined reasoning patterns; it could also discover its own behaviours that earned high rewards. The researchers found that the LLM learnt to evaluate its own in-progress reasoning by reflecting on the statements it had already generated, and that it learnt to explore alternative approaches in its responses. As one example of this, the model learnt to insert phrases into its reasoning such as "Wait. That's an aha moment I can flag here." However, the LLM also learnt certain behaviours which, although they might have helped it to produce better responses, resulted in reasoning traces that were difficult to understand. For example, the LLM adopted a behaviour in which its reasoning would switch back and forth between Chinese and English (the two languages the LLM was optimized to understand). The researchers also found that the LLM learnt to produce extremely long reasoning traces, which can contain 10,000 words or more. Furthermore, the reinforcement-learning method had to be trained on questions with clear-cut right or wrong answers (such as maths problems). This meant that the LLM didn't learn how to handle questions requiring nuanced, subjective or long-form responses. The researchers show that many of these issues were resolved by using a multistage training framework, in which the LLM was exposed to alternating stages of reinforcement learning and supervised learning. Trained in this way, DeepSeek-R1 achieved state-of-the-art accuracy on tasks that assessed maths and coding skills, factual knowledge and other forms of language understanding, in both Chinese and English. Ultimately, the question of what makes a good reasoning LLM is a philosophical as much as a technical one. What behaviours do users want from an AI when they ask it hard questions? At one extreme, imagine an AI that has learnt to reason in a gibberish language that no human can hope to understand. Should we care that its reasoning is completely unintelligible, so long as it arrives at the correct answer? The version of DeepSeek-R1 that was trained through reinforcement learning alone tended to produce responses that were convoluted, long or otherwise difficult for humans to read. Ultimately, the researchers found that they needed to introduce some supervised learning to strike a balance between effective reasoning and intelligible responses to a broad variety of user queries. DeepSeek-R1 has developed from a powerful but opaque solution-finder into a system that is capable of human-like conversations. This journey reflects the need for AI systems that not only accurately solve problems but are also tools that humans can understand, trust and meaningfully collaborate with.
[2]
DeepSeek bolsters AI 'reasoning' using trial-and-error
Chinese AI company DeepSeek has shown it can improve the reasoning of its LLM DeepSeek-R1 through trial-and-error based reinforcement learning, and even be made to explain its reasoning on math and coding problems, even though explanations might sometimes be unintelligible. The release of DeepSeek-R1 in January 2025 inspired a $589 billion wipeout of Nvidia's market value, as investors feared it represented an easier and cheaper route to natural language question answering systems such ChatGPT, from Silicon Valley darling OpenAI. In a paper published in the science journal Nature, the DeepSeek AI team say they have established that its LLMs can be incentivized to learn to reason without getting examples from humans. In this way, reinforcement learning, akin to learning through trial and error, can slash the human input required to boost their model's performance. They argue that the approach improves performance on math and coding problems beyond that of LLMs trained on a corpus of human text and examples. In an accompanying paper, Carnegie Mellon University assistant professor Daphne Ippolito and her PhD student Yiming Zhang explain that reinforcement learning is similar to how a child might learn to play a video game. "As the child navigates their avatar through the game world, they learn through trial and error that some actions (such as collecting gold coins) earn points, whereas others (such as running into enemies) set their score back to zero," their article said. "This contrasts with previous prompting-based approaches, which were more akin to expecting a child to learn to master a video game by having them read the instructions, or supervised-learning approaches, which can be likened to expecting the child to master a game by watching a sibling play it hundreds of times," they said. In addition to improving the reasoning behavior of the model, DeepSeek also showed the trial-and-error process helped the model explain its working, so to speak. But some of the reasoning was difficult to follow for mere humans. For a start, it would sometimes inexplicably switch back and forth between English and Chinese. It might also produce extremely long reasoning containing more than 10,000 words. Other limitations come from the fact that it was only trained on clear-cut right or wrong answers and has yet to show an aptitude for more nuanced, subjective or long form responses. Yet by combining reinforcement learning and supervised learning "DeepSeek-R1 achieved state-of-the art accuracy on tasks that assessed maths and coding skills, factual knowledge and other forms of language understanding, in both Chinese and English," Ippolito and Zhang claimed. ®
Share
Share
Copy Link
Chinese AI company DeepSeek has developed a novel approach to improve AI reasoning using reinforcement learning. Their model, DeepSeek-R1, demonstrates enhanced performance in math and coding tasks without relying on human examples.
In a groundbreaking development, Chinese AI company DeepSeek has introduced a novel method to enhance AI reasoning capabilities using reinforcement learning. The research, published in Nature, demonstrates how their large language model (LLM) DeepSeek-R1 can learn to reason and explain its thought process without relying on human examples
1
.DeepSeek's approach leverages reinforcement learning, a technique akin to how children learn through trial and error. This method contrasts with traditional prompting-based or supervised learning approaches, which rely heavily on human input or examples
2
.The model is rewarded for correct answers and penalized for incorrect ones, particularly in mathematics and programming tasks where answers are easily verifiable. This process naturally encourages the AI to develop its own reasoning strategies and output its thought process
1
.During training, DeepSeek-R1 exhibited interesting behaviors:
1
.However, the approach has limitations. The model occasionally produces extremely long reasoning traces and struggles with nuanced or subjective questions
2
.Related Stories
Despite these challenges, DeepSeek-R1 has achieved state-of-the-art accuracy in tasks assessing mathematics, coding skills, factual knowledge, and language understanding in both Chinese and English
2
.The release of DeepSeek-R1 in January 2025 had a significant impact on the AI market, causing a $589 billion decrease in Nvidia's market value. Investors viewed it as a potential cheaper alternative to systems like OpenAI's ChatGPT
2
.This research opens new avenues for AI development, potentially reducing the need for extensive human input in training advanced language models. As AI continues to evolve, DeepSeek's approach could lead to more efficient and capable AI systems, particularly in fields requiring complex reasoning and problem-solving skills.
Summarized by
Navi
[2]