Curated by THEOUTPOST
On Mon, 21 Apr, 4:02 PM UTC
3 Sources
[1]
OpenAI's newest o3 and o4-mini models excel at coding and math - but hallucinate more often
A hot potato: OpenAI's latest artificial intelligence models, o3 and o4-mini, have set new benchmarks in coding, math, and multimodal reasoning. Yet, despite these advancements, the models are drawing concern for an unexpected and troubling trait: they hallucinate, or fabricate information, at higher rates than their predecessors - a reversal of the trend that has defined AI progress in recent years. Historically, each new generation of OpenAI's models has delivered incremental improvements in factual accuracy, with hallucination rates dropping as the technology matured. However, internal testing and third-party evaluations now reveal that o3 and o4-mini, both classified as "reasoning models," are more prone to making things up than earlier reasoning models such as o1, o1-mini, and o3-mini, as well as the general-purpose GPT-4o, according to a report by TechCrunch. On OpenAI's PersonQA benchmark, which measures a model's ability to answer questions about people accurately, o3 hallucinated in 33 percent of cases, more than double the rate of o1 and o3-mini, which scored 16 percent and 14.8 percent, respectively. O4-mini performed even worse, with a staggering 48 percent hallucination rate - nearly one in every two responses. The reasons for this regression remain unclear, even to OpenAI's own researchers. In technical documentation, the company admits that "more research is needed" to understand why scaling up reasoning models appears to worsen the hallucination problem. One hypothesis, offered by Neil Chowdhury, a researcher at the nonprofit AI lab Transluce and a former OpenAI employee, is that the reinforcement learning techniques used for the o-series models may amplify issues that previous post-training processes had managed to mitigate, if not eliminate. Third-party findings support this theory: Transluce documented instances where o3 invented actions it could not possibly have performed, such as claiming to run code on a 2021 MacBook Pro "outside of ChatGPT" and then copying the results into its answer - an outright fabrication. Sarah Schwettmann, co-founder of Transluce, warns that the higher hallucination rate could limit o3's usefulness in real-world applications. Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, told TechCrunch that while o3 excels in coding workflows, it often generates broken website links. These hallucinations pose a substantial risk for businesses and industries where accuracy, such as law or finance, is paramount. A model that fabricates facts could introduce errors into legal contracts or financial reports, undermining trust and utility. OpenAI acknowledges the challenge, with spokesperson Niko Felix telling TechCrunch that addressing hallucinations "across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability." One promising avenue for reducing hallucinations is integrating web search capabilities. OpenAI's GPT-4o, when equipped with search, achieves 90 percent accuracy on the SimpleQA benchmark, suggesting that real-time retrieval could help ground AI responses in verifiable facts - at least where users are comfortable sharing their queries with third-party search providers. Meanwhile, the broader AI industry is shifting its focus toward reasoning models, which promise improved performance on complex tasks without requiring exponentially more data and computing power. Yet, as the experience with o3 and o4-mini shows, this new direction brings its own set of challenges, chief among them the risk of increased hallucinations.
[2]
OpenAI's Hot New AI Has an Embarrassing Problem
OpenAI launched its latest AI reasoning models, dubbed o3 and o4-mini, last week. According to the Sam Altman-led company, the new models outperform their predecessors and "excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis." But there's one extremely important area where o3 and o4-mini appear to instead be taking a major step back: they tend to make things up -- or "hallucinate" -- substantially more than those earlier versions, as TechCrunch reports. The news once again highlights a nagging technical issue that has plagued the industry for years now. Tech companies have struggled to rein in rampant hallucinations, which have greatly undercut the usefulness of tools like ChatGPT. Worryingly, OpenAI's two new models also buck a historical trend, which has seen each new model incrementally hallucinating less than the previous one, as TechCrunch points out, suggesting OpenAI is now headed in the wrong direction. According to OpenAI's own internal testing, o3 and o4-mini tend to hallucinate more than older models, including o1, o1-mini, and even o3-mini, which was released in late January. Worse yet, the firm doesn't appear to fully understand why. According to its technical report, "more research is needed to understand the cause" of the rampant hallucinations. Its o3 model scored a hallucination rate of 33 percent on the company's in-house accuracy benchmark, dubbed PersonQA. That's roughly double the rate compared to the company's preceding reasoning models. Its o4-mini scored an abysmal hallucination rate of 48 percent, part of which could be due to it being a smaller model that has "less world knowledge" and therefore tends to "hallucinate more," according to OpenAI. Nonprofit AI research company Transluce also found in its own testing that o3 had a strong tendency to hallucinate, especially when generating computer code. The extent to which it tried to cover for its own shortcomings is baffling. "It further justifies hallucinated outputs when questioned by the user, even claiming that it uses an external MacBook Pro to perform computations and copies the outputs into ChatGPT," Transluce wrote in its blog post. Experts even told TechCrunch OpenAI's o3 model hallucinates broken website links that simply don't work when the user tries to click them. Unsurprisingly, OpenAI is well aware of these shortcomings. "Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability," OpenAI spokesperson Niko Felix told TechCrunch.
[3]
New OpenAI Models Hallucinating More Than their Predecessor
OpenAI's new artificial intelligence (AI) reasoning models, o3 and o4-mini, are hallucinating more than their predecessor, the company's internal testing report reveals. OpenAI launched the two new reasoning models, designed to pause and work through questions before responding, earlier this month, according to TechCrunch. An AI model can sometimes produce inaccurate and misleading results. Such inaccurate results are referred to as "AI hallucinations". A variety of factors, like insufficient training data, incorrect assumptions made by the model, or biases in the data used to train the model, can cause these hallucinations or errors. The problem is further amplified as people use these AI models for making important decisions like medical diagnoses or financial trading. "We tested OpenAI o3 and o4-mini against PersonQA, an evaluation that aims to elicit hallucinations. PersonQA is a dataset of questions and publicly available facts that measures the model's accuracy on attempted answers. "We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated)," the company highlighted in its report. The ChatGPT developer noted that its o4-mini model "underperforms" when compared to the o1 and o3 models. The company expected this as smaller models have "less world knowledge" and tend to hallucinate more, the report noted. The OpenAI o4-mini model scored 0.36 on accuracy in the PersonQA evaluation, compared to the o3 model's 0.59 and the o1 model's 0.47. Not only was the OpenAI o4-mini model the least accurate, it was also hallucinating the most: it scored 0.48 in the hallucination rate test by PersonQA. This was slightly higher than the o3 model's 0.33 score and much higher than the o1 model's 0.16 score. Interestingly, the o3 model had a better accuracy rate than the o1 model, but its hallucination rate was more than twice the o1 model's hallucination rate, evident from the scores mentioned above. "Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims," the company highlighted in the report. Another worrying sign that emerged from the report was that as of now OpenAI is unable to decipher why its new reasoning models are hallucinating more. The ChatGPT developer noted, "More research is needed to understand the cause of this result." OpenAI trains its o-series models with "large-scale" reinforcement learning from human feedback (RLHF) on chains of thought. "These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment," the company remarks in the report. RLHF is a machine learning technique in which a developer trains the "reward model" with the help of direct feedback from a human being. It is then used for optimising the performance of an AI agent. There are two hypotheses that explain why large language models (LLMs) hallucinate. The first hypothesis is that LLMs "lack the understanding of the cause and effect of their actions" and that this can be addressed by treating response generation as causal interventions. The second hypothesis, as per the blog, is that the mismatch between the LLM's internal knowledge and the labeller's internal knowledge causes the LLMs to hallucinate. "During [Supervised Fine-Tuning] SFT, LLMs are trained to mimic responses written by humans. If we give a response using the knowledge that we have but the LLM doesn't have, we're teaching the LLM to hallucinate," the blog added. Most recently, in March 2025, a privacy rights group called Noyb, and a Norway resident, Arve Hjalmar Holmer, filed a complaint against ChatGPT in Europe for producing erroneous defamatory information. Holmen alleged that OpenAI's chatbot was providing a made-up claim about him murdering two of his children and attempting to kill the third. Back home, in February 2025 the Bengaluru bench of the Income Tax Appellate Authority (ITAT) "hastily" withdrew an order as it cited made-up court judgements. There have been other such cases of AI models hallucinating and providing made-up answers to users. How OpenAI and others choose to respond to these issues remains to be seen.
Share
Share
Copy Link
OpenAI's latest AI models, o3 and o4-mini, show improved performance in coding and math but struggle with increased hallucination rates, raising concerns about their reliability and real-world applications.
OpenAI has recently launched its latest artificial intelligence models, o3 and o4-mini, which have demonstrated exceptional capabilities in coding, math, and complex reasoning tasks. However, these advancements come with an unexpected drawback: increased rates of hallucination, or the tendency to generate false or misleading information 1.
The new models, classified as "reasoning models," have set new benchmarks in solving complex math, coding, and scientific challenges while also showing strong visual perception and analysis skills 2. However, internal testing and third-party evaluations have revealed a troubling trend: these models are more prone to hallucinations than their predecessors.
On OpenAI's PersonQA benchmark, which measures a model's ability to answer questions about people accurately, o3 hallucinated in 33% of cases, more than double the rate of earlier models like o1 and o3-mini. The o4-mini model performed even worse, with a staggering 48% hallucination rate 1.
This increase in hallucination rates represents a reversal of the trend that has defined AI progress in recent years. Historically, each new generation of OpenAI's models has delivered incremental improvements in factual accuracy. The reasons for this regression remain unclear, even to OpenAI's own researchers 1.
One hypothesis suggests that the reinforcement learning techniques used for the o-series models may amplify issues that previous post-training processes had managed to mitigate. OpenAI acknowledges that "more research is needed" to understand why scaling up reasoning models appears to worsen the hallucination problem 3.
The higher hallucination rates could significantly limit the usefulness of these models in real-world applications, particularly in industries where accuracy is paramount, such as law or finance. Experts have noted instances where o3 invented actions it could not possibly have performed and generated broken website links 1 2.
OpenAI is aware of these challenges and is actively working to improve the accuracy and reliability of their models. The company is exploring various avenues to address the hallucination issue, including the integration of web search capabilities to ground AI responses in verifiable facts 1.
As the AI industry shifts its focus toward reasoning models, which promise improved performance on complex tasks without requiring exponentially more data and computing power, the experience with o3 and o4-mini highlights new challenges. The increased risk of hallucinations in these advanced models underscores the need for continued research and development to balance improved reasoning capabilities with factual accuracy 1 3.
An exploration of AI hallucinations, their causes, and potential consequences across various applications, highlighting the need for vigilance and fact-checking in AI-generated content.
8 Sources
8 Sources
OpenAI launches o3 and o4-mini, new AI reasoning models with enhanced capabilities in math, coding, science, and visual understanding. These models can integrate images into their reasoning process and use ChatGPT tools independently.
14 Sources
14 Sources
OpenAI's Whisper, an AI-powered transcription tool, is found to generate hallucinations and inaccuracies, raising alarm as it's widely used in medical settings despite warnings against its use in high-risk domains.
24 Sources
24 Sources
OpenAI releases GPT-4.5, its latest AI model, with limited availability due to GPU shortages. The update brings incremental improvements but raises questions about the company's focus on AGI versus practical applications.
14 Sources
14 Sources
OpenAI has introduced a new version of ChatGPT with improved reasoning abilities in math and science. While the advancement is significant, it also raises concerns about potential risks and ethical implications.
15 Sources
15 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved