Curated by THEOUTPOST
On Mon, 21 Apr, 4:02 PM UTC
7 Sources
[1]
OpenAI's most capable models hallucinate more than earlier ones
Researchers say the hallucinations make o3 "less useful" than it would be. OpenAI says its latest models, o3 and o4-mini, are its most powerful yet. However, research shows the models also hallucinate more -- at least twice as much as earlier models. Also: How to use ChatGPT: A beginner's guide to the most popular AI chatbot In the system card, a report that accompanies each new AI model, and published with the release last week, OpenAI reported that o4-mini is less accurate and hallucinates more than both o1 and o3. Using PersonQA, an internal test based on publicly available information, the company found o4-mini hallucinated in 48% of responses, which is three times o1's rate. While o4-mini is smaller, cheaper, and faster than o3, and, therefore, wasn't expected to outperform it, o3 still hallucinated in 33% of responses, or twice the rate of o1. Of the three models, o3 scored the best on accuracy. Also: OpenAI's o1 lies more than any major AI model. Why that matters "o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims," OpenAI's report explained. "More research is needed to understand the cause of this result." Hallucinations, which refer to fabricated claims, studies, and even URLs, have continued to plague even the most cutting-edge advancements in AI. There is currently no perfect solution for preventing or identifying them, though OpenAI has tried some approaches. Additionally, fact-checking is a moving target, making it hard to embed and scale. Fact-checking involves some level of human cognitive skills that AI mostly lacks, like common sense, discernment, and contextualization. As a result, the extent to which a model hallucinates relies heavily on training data quality (and access to the internet for current information). Minimizing false information in training data can lessen the chance of an untrue statement downstream. However, this technique doesn't prevent hallucinations, as many of an AI chatbot's creative choices are still not fully understood. Overall, the risk of hallucinations tends to reduce slowly with each new model release, which is what makes o3 and o4-mini's scores somewhat unexpected. Though o3 gained 12 percentage points over o1 in accuracy, the fact that the model hallucinates twice as much suggests its accuracy hasn't grown proportionally to its capabilities. Also: My two favorite AI apps on Linux - and how I use them to get more done Like other recent releases, o3 and o4-mini are reasoning models, meaning they externalize the steps they take to interpret a prompt for a user to see. Last week, independent research lab Transluce published its evaluation, which found that o3 often falsifies actions it can't take in response to a request, including claiming to run Python in a coding environment, despite the chatbot not having that ability. What's more, the model doubles down when caught. "[o3] further justifies hallucinated outputs when questioned by the user, even claiming that it uses an external MacBook Pro to perform computations and copies the outputs into ChatGPT," the report explained. Transluce found that these false claims about running code were more frequent in o-series models (o1, o3-mini, and o3) than GPT-series models (4.1 and 4o). This result is especially confusing because reasoning models take longer to provide more thorough, higher-quality answers. Transluce cofounder Sarah Schwettmann even told TechCrunch that "o3's hallucination rate may make it less useful than it otherwise would be." Also: Chatbots are distorting news - even for paid users The report from Transluce said: "Although truthfulness issues from post-training are known to exist, they do not fully account for the increased severity of hallucination in reasoning models. We hypothesize that these issues might be intensified by specific design choices in o-series reasoning models, such as outcome-based reinforcement learning and the omission of chains-of-thought from previous turns." Last week, sources inside OpenAI and third-party testers confirmed the company has drastically minimized safety testing for new models, including o3. While the system card shows o3 and o4-mini are "approximately on par" with o1 for robustness against jailbreak attempts (all three score between 96% and 100%), these hallucination scores raise questions about the non-safety-related impacts of changing testing timelines. The onus is still on users to fact-check any AI model's output. This strategy appears wise when using the latest-generation reasoning models.
[2]
OpenAI's newest o3 and o4-mini models excel at coding and math - but hallucinate more often
A hot potato: OpenAI's latest artificial intelligence models, o3 and o4-mini, have set new benchmarks in coding, math, and multimodal reasoning. Yet, despite these advancements, the models are drawing concern for an unexpected and troubling trait: they hallucinate, or fabricate information, at higher rates than their predecessors - a reversal of the trend that has defined AI progress in recent years. Historically, each new generation of OpenAI's models has delivered incremental improvements in factual accuracy, with hallucination rates dropping as the technology matured. However, internal testing and third-party evaluations now reveal that o3 and o4-mini, both classified as "reasoning models," are more prone to making things up than earlier reasoning models such as o1, o1-mini, and o3-mini, as well as the general-purpose GPT-4o, according to a report by TechCrunch. On OpenAI's PersonQA benchmark, which measures a model's ability to answer questions about people accurately, o3 hallucinated in 33 percent of cases, more than double the rate of o1 and o3-mini, which scored 16 percent and 14.8 percent, respectively. O4-mini performed even worse, with a staggering 48 percent hallucination rate - nearly one in every two responses. The reasons for this regression remain unclear, even to OpenAI's own researchers. In technical documentation, the company admits that "more research is needed" to understand why scaling up reasoning models appears to worsen the hallucination problem. One hypothesis, offered by Neil Chowdhury, a researcher at the nonprofit AI lab Transluce and a former OpenAI employee, is that the reinforcement learning techniques used for the o-series models may amplify issues that previous post-training processes had managed to mitigate, if not eliminate. Third-party findings support this theory: Transluce documented instances where o3 invented actions it could not possibly have performed, such as claiming to run code on a 2021 MacBook Pro "outside of ChatGPT" and then copying the results into its answer - an outright fabrication. Sarah Schwettmann, co-founder of Transluce, warns that the higher hallucination rate could limit o3's usefulness in real-world applications. Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, told TechCrunch that while o3 excels in coding workflows, it often generates broken website links. These hallucinations pose a substantial risk for businesses and industries where accuracy, such as law or finance, is paramount. A model that fabricates facts could introduce errors into legal contracts or financial reports, undermining trust and utility. OpenAI acknowledges the challenge, with spokesperson Niko Felix telling TechCrunch that addressing hallucinations "across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability." One promising avenue for reducing hallucinations is integrating web search capabilities. OpenAI's GPT-4o, when equipped with search, achieves 90 percent accuracy on the SimpleQA benchmark, suggesting that real-time retrieval could help ground AI responses in verifiable facts - at least where users are comfortable sharing their queries with third-party search providers. Meanwhile, the broader AI industry is shifting its focus toward reasoning models, which promise improved performance on complex tasks without requiring exponentially more data and computing power. Yet, as the experience with o3 and o4-mini shows, this new direction brings its own set of challenges, chief among them the risk of increased hallucinations.
[3]
OpenAI's newest AI models hallucinate way more, for reasons unknown
This does not bode well if you're using the new o3 and o4-mini reasoning models for factual answers. Last week, OpenAI released its new o3 and o4-mini reasoning models, which perform significantly better than their o1 and o3-mini predecessors and have new capabilities like "thinking with images" and agentically combining AI tools for more complex results. However, according to OpenAI's internal tests, these new o3 and o4-mini reasoning models also hallucinate significantly more often than previous AI models, reports TechCrunch. This is unusual as newer models tend to hallucinate less as the underlying AI tech improves. In the realm of LLMs and reasoning AIs, a "hallucination" occurs when the model makes up information that sounds convincing but has no bearing in truth. In other words, when you ask questions to ChatGPT, it may respond with an answer that's patently false or incorrect. OpenAI's in-house benchmark PersonQA -- which is used to measure the factual accuracy of its AI models when talking about people -- found that o3 hallucinated in 33 percent of responses while o4-mini did even worse at 48 percent. By comparison, the older o1 and o3-mini models hallucinated 16 percent and 14.8 percent, respectively. As of now, OpenAI says they don't know why hallucinations have increased in the newer reasoning models. Hallucinations may be fine for creative endeavors, but they undermine the credibility of AI assistants like ChatGPT when used for tasks where accuracy is paramount. In a statement to TechCrunch, an OpenAI rep said that the company is "continually working to improve [their models'] accuracy and reliability."
[4]
OpenAI's leading models keep making things up -- here's why
OpenAI's newly released o3 and o4-mini are some of the smartest AI models to ever be released, but they seem to be suffering from one major problem. Both models are hallucinating. This in itself isn't out of the ordinary, as most AI models still tend to do this. But these two new versions seem to be hallucinating more than a number of OpenAI's older models. Historically, while most new models continue to hallucinate, the risk has reduced with each new release. The potentially larger issue here is that OpenAI doesn't know why this has happened. If you've used an AI model, you've most likely seen it hallucinate. This is when the model produces incorrect or misleading results. That could mean producing incorrect statistics, getting a picture prompt wrong or simply messing up on the prompt given. This can be a small, non-important issue. For example, if a chatbot is asked to create a poem only using words beginning with "b" and includes the word "tree," that would be a hallucination, albeit a rather low stakes one. However, if a chatbot was asked for a list of foods that are safe for someone with a gluten intolerance, and it suggests bread rolls, that would be a hallucination with some risk. In OpenAI's technical report for these two models, it was explained that they both underperformed in PersonQA, an evaluation of AI model's hallucination rates. "This is expected, as smaller models have less world knowledge and tend to hallucinate more. However, we also observed some performance differences comparing o1 and o3," the report states. "Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims. More research is needed to understand the cause of this result." OpenAI's report found that o3 hallucinated in response to 33% of questions. That is roughly double the hallucination rate of OpenAI's previous reasoning models. Both of these models are still fairly new and, now released to the public, they could see drastic improvements in their hallucination rates as testing continues. However, as both models are set up for more complex tasks, this could be problematic going forward. As mentioned above, hallucinations can be a funny quirk in non-important prompts. However, reasoning models (AI designed to take on more complex tasks) are typically handling more important information. If this is a pattern that continues with future reasoning models from OpenAI, it could make for a difficult sales pitch, especially for larger companies looking to spend hefty amounts of money to use o3 and o4-mini.
[5]
OpenAI's Hot New AI Has an Embarrassing Problem
OpenAI launched its latest AI reasoning models, dubbed o3 and o4-mini, last week. According to the Sam Altman-led company, the new models outperform their predecessors and "excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis." But there's one extremely important area where o3 and o4-mini appear to instead be taking a major step back: they tend to make things up -- or "hallucinate" -- substantially more than those earlier versions, as TechCrunch reports. The news once again highlights a nagging technical issue that has plagued the industry for years now. Tech companies have struggled to rein in rampant hallucinations, which have greatly undercut the usefulness of tools like ChatGPT. Worryingly, OpenAI's two new models also buck a historical trend, which has seen each new model incrementally hallucinating less than the previous one, as TechCrunch points out, suggesting OpenAI is now headed in the wrong direction. According to OpenAI's own internal testing, o3 and o4-mini tend to hallucinate more than older models, including o1, o1-mini, and even o3-mini, which was released in late January. Worse yet, the firm doesn't appear to fully understand why. According to its technical report, "more research is needed to understand the cause" of the rampant hallucinations. Its o3 model scored a hallucination rate of 33 percent on the company's in-house accuracy benchmark, dubbed PersonQA. That's roughly double the rate compared to the company's preceding reasoning models. Its o4-mini scored an abysmal hallucination rate of 48 percent, part of which could be due to it being a smaller model that has "less world knowledge" and therefore tends to "hallucinate more," according to OpenAI. Nonprofit AI research company Transluce also found in its own testing that o3 had a strong tendency to hallucinate, especially when generating computer code. The extent to which it tried to cover for its own shortcomings is baffling. "It further justifies hallucinated outputs when questioned by the user, even claiming that it uses an external MacBook Pro to perform computations and copies the outputs into ChatGPT," Transluce wrote in its blog post. Experts even told TechCrunch OpenAI's o3 model hallucinates broken website links that simply don't work when the user tries to click them. Unsurprisingly, OpenAI is well aware of these shortcomings. "Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability," OpenAI spokesperson Niko Felix told TechCrunch.
[6]
It's not your imagination -- ChatGPT models actually do hallucinate more now
Table of Contents Table of Contents What do the tests say? What are AI "hallucinations" and why do they happen? What's the fix? OpenAI released a paper last week detailing various internal tests and findings about its o3 and o4-mini models. The main differences between these newer models and the first versions of ChatGPT we saw in 2023 are their advanced reasoning and multimodal capabilities. o3 and o4-mini can generate images, search the web, automate tasks, remember old conversations, and solve complex problems. However, it seems these improvements have also brought unexpected side effects. What do the tests say? OpenAI has a specific test for measuring hallucination rates called PersonQA. It includes a set of facts about people to "learn" from and a set of questions about those people to answer. The model's accuracy is measured based on its attempts to answer. Last year's o1 model achieved an accuracy rate of 47% and a hallucination rate of 16%. Recommended Videos Since these two values don't add up to 100%, we can assume the rest of the responses were neither accurate nor hallucinations. The model might sometimes say it doesn't know or can't locate the information, it may not make any claims at all and provide related information instead, or it might make a slight mistake that can't be classified as a full-on hallucination. When o3 and o4-mini were tested against this evaluation, they hallucinated at a significantly higher rate than o1. According to OpenAI, this was somewhat expected for the o4-mini model because it's smaller and has less world knowledge, leading to more hallucinations. Still, the 48% hallucination rate it achieved seems very high considering o4-mini is a commercially available product that people are using to search the web and get all sorts of different information and advice. o3, the full-sized model, hallucinated on 33% of its responses during the test, outperforming o4-mini but doubling the rate of hallucination compared to o1. It also had a high accuracy rate, however, which OpenAI attributes to its tendency to make more claims overall. So, if you use either of these two newer models and have noticed a lot of hallucinations, it's not just your imagination. (Maybe I should make a joke there like "Don't worry, you're not the one that's hallucinating.") What are AI "hallucinations" and why do they happen? While you've likely heard about AI models "hallucinating" before, it's not always clear what it means. Whenever you use an AI product, OpenAI or otherwise, you're pretty much guaranteed to see a disclaimer somewhere saying that its responses can be inaccurate and you have to fact-check for yourself. Inaccurate information can come from all over the place -- sometimes a bad fact gets on to Wikipedia or users spout nonsense on Reddit, and this misinformation can find its way into AI responses. For example, Google's AI Overviews got a lot of attention when it suggested a recipe for pizza that included "non-toxic glue." In the end, it was discovered that Google got this "information" from a joke on a Reddit thread. However, these aren't "hallucinations," they're more like tracable mistakes that arise from bad data and misinterpretation. Hallucinations, on the other hand, are when the AI model makes a claim without any clear source or reason. It often happens when an AI model can't find the information it needs to answer a specific query, and OpenAI has defined it as "a tendency to invent facts in moments of uncertainty." Other industry figures have called it "creative gap-filling." You can encourage hallucinations by giving ChatGPT leading questions like "What are the seven iPhone 16 models available right now?" Since there aren't seven models, the LLM is somewhat likely to give you some real answers -- and then make up additional models to finish the job. Chatbots like ChatGPT aren't only trained on the internet data that informs the content of their responses, they're also trained on "how to respond". They're shown thousands of example queries and matching ideal responses to encourage the right kind of tone, attitude, and level of politeness. This part of the training process is what causes an LLM to sound like it agrees with you or understands what you're saying even as the rest of its output completely contradicts those statements. It's possible that this training could be part of the reason hallucinations are so frequent -- because a confident response that answers the question has been reinforced as a more favorable outcome compared to a response that fails to answer the question. To us, it seems obvious that spouting random lies is worse than just not knowing the answer -- but LLMs don't "lie." They don't even know what a lie is. Some people say AI mistakes are like human mistakes, and since "we don't get things right all the time, we shouldn't expect the AI to either." However, it's important to remember that mistakes from AI are simply a result of imperfect processes designed by us. AI models don't lie, develop misunderstandings, or misremember information like we do. They don't even have concepts of accuracy or inaccuracy -- they simply predict the next word in a sentence based on probabilities. And since we're thankfully still in a state where the most commonly said thing is likely to be the correct thing, those reconstructions often reflect accurate information. That makes it sound like when we get "the right answer," it's just a random side effect rather than an outcome we've engineered -- and that is indeed how things work. We feed an entire internet's worth of information to these models -- but we don't tell them which information is good or bad, accurate or inaccurate -- we don't tell them anything. They don't have existing foundational knowledge or a set of underlying principles to help them sort the information for themselves either. It's all just a numbers game -- the patterns of words that exist most frequently in a given context become the LLM's "truth." To me, this sounds like a system that's destined to crash and burn -- but others believe this is the system that will lead to AGI (though that's a different discussion.) What's the fix? The problem is, OpenAI doesn't yet know why these advanced models tend to hallucinate more often. Perhaps with a little more research, we will be able to understand and fix the problem -- but there's also a chance that things won't go that smoothly. The company will no doubt keep releasing more and more "advanced" models, and there is a chance that hallucination rates will keep rising. In this case, OpenAI might need to pursue a short-term solution as well as continue its research into the root cause. After all, these models are money-making products and they need to be in a useable state. I'm no AI scientist, but I suppose my first idea would be to create some kind of aggregate product -- a chat interface that has access to multiple different OpenAI models. When a query requires advanced reasoning, it would call on GPT-4o, and when it wants to minimize the chances of hallucinations, it would call on an older model like o1. Perhaps the company would be able to go even fancier and use different models to take care of different elements of a single query, and then use an additional model to stitch it all together at the end. Since this would essentially be teamwork between multiple AI models, perhaps some kind of fact-checking system could be implemented as well. However, raising the accuracy rates is not the main goal. The main goal is to lower hallucination rates, which means we need to value responses that say "I don't know" as well as responses with the right answers. In reality, I have no idea what OpenAI will do or how worried its researchers really are about the growing rate of hallucinations. All I know is that more hallucinations are bad for end users -- it just means more and more opportunities for us to be misled without realizing it. If you're big into LLMs, there's no need to stop using them -- but don't let the desire to save time win out over the need to fact-check the results. Always fact-check!
[7]
New OpenAI Models Hallucinating More Than their Predecessor
OpenAI's new artificial intelligence (AI) reasoning models, o3 and o4-mini, are hallucinating more than their predecessor, the company's internal testing report reveals. OpenAI launched the two new reasoning models, designed to pause and work through questions before responding, earlier this month, according to TechCrunch. An AI model can sometimes produce inaccurate and misleading results. Such inaccurate results are referred to as "AI hallucinations". A variety of factors, like insufficient training data, incorrect assumptions made by the model, or biases in the data used to train the model, can cause these hallucinations or errors. The problem is further amplified as people use these AI models for making important decisions like medical diagnoses or financial trading. "We tested OpenAI o3 and o4-mini against PersonQA, an evaluation that aims to elicit hallucinations. PersonQA is a dataset of questions and publicly available facts that measures the model's accuracy on attempted answers. "We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated)," the company highlighted in its report. The ChatGPT developer noted that its o4-mini model "underperforms" when compared to the o1 and o3 models. The company expected this as smaller models have "less world knowledge" and tend to hallucinate more, the report noted. The OpenAI o4-mini model scored 0.36 on accuracy in the PersonQA evaluation, compared to the o3 model's 0.59 and the o1 model's 0.47. Not only was the OpenAI o4-mini model the least accurate, it was also hallucinating the most: it scored 0.48 in the hallucination rate test by PersonQA. This was slightly higher than the o3 model's 0.33 score and much higher than the o1 model's 0.16 score. Interestingly, the o3 model had a better accuracy rate than the o1 model, but its hallucination rate was more than twice the o1 model's hallucination rate, evident from the scores mentioned above. "Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims," the company highlighted in the report. Another worrying sign that emerged from the report was that as of now OpenAI is unable to decipher why its new reasoning models are hallucinating more. The ChatGPT developer noted, "More research is needed to understand the cause of this result." OpenAI trains its o-series models with "large-scale" reinforcement learning from human feedback (RLHF) on chains of thought. "These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment," the company remarks in the report. RLHF is a machine learning technique in which a developer trains the "reward model" with the help of direct feedback from a human being. It is then used for optimising the performance of an AI agent. There are two hypotheses that explain why large language models (LLMs) hallucinate. The first hypothesis is that LLMs "lack the understanding of the cause and effect of their actions" and that this can be addressed by treating response generation as causal interventions. The second hypothesis, as per the blog, is that the mismatch between the LLM's internal knowledge and the labeller's internal knowledge causes the LLMs to hallucinate. "During [Supervised Fine-Tuning] SFT, LLMs are trained to mimic responses written by humans. If we give a response using the knowledge that we have but the LLM doesn't have, we're teaching the LLM to hallucinate," the blog added. Most recently, in March 2025, a privacy rights group called Noyb, and a Norway resident, Arve Hjalmar Holmer, filed a complaint against ChatGPT in Europe for producing erroneous defamatory information. Holmen alleged that OpenAI's chatbot was providing a made-up claim about him murdering two of his children and attempting to kill the third. Back home, in February 2025 the Bengaluru bench of the Income Tax Appellate Authority (ITAT) "hastily" withdrew an order as it cited made-up court judgements. There have been other such cases of AI models hallucinating and providing made-up answers to users. How OpenAI and others choose to respond to these issues remains to be seen.
Share
Share
Copy Link
OpenAI's new o3 and o4-mini models show improved performance in various tasks but face a significant increase in hallucination rates, raising concerns about their reliability and usefulness.
OpenAI has released its latest AI models, o3 and o4-mini, touting significant improvements in coding, math, and multimodal reasoning capabilities 2. These new "reasoning models" are designed to handle more complex tasks and provide more thorough, higher-quality answers 1. According to OpenAI, the models excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis 5.
Despite their advanced capabilities, o3 and o4-mini have shown a concerning trend: they hallucinate, or fabricate information, at higher rates than their predecessors 123. This development breaks the historical pattern of decreasing hallucination rates with each new model release 2.
OpenAI's internal testing using the PersonQA benchmark revealed:
The exact reasons for this increase in hallucinations remain unclear, even to OpenAI's researchers 12. Some hypotheses include:
These hallucinations pose significant risks for industries where accuracy is crucial, such as law and finance 2. Sarah Schwettmann, co-founder of Transluce, warns that the higher hallucination rate could limit o3's usefulness in real-world applications 2.
Researchers have observed concerning behaviors in the new models:
OpenAI acknowledges the challenge, stating that addressing hallucinations "across all our models is an ongoing area of research" 25. The company is exploring potential solutions, including:
As the AI industry shifts focus towards reasoning models, the experience with o3 and o4-mini highlights the need for balanced progress in both capabilities and reliability 2. For now, users are advised to remain cautious and fact-check AI-generated information, especially when using these latest-generation reasoning models 1.
Reference
[2]
[4]
[5]
Recent tests reveal that newer AI models, including OpenAI's latest offerings, are experiencing higher rates of hallucinations despite improvements in reasoning capabilities. This trend raises concerns about AI reliability and its implications for various applications.
6 Sources
6 Sources
An exploration of AI hallucinations, their causes, and potential consequences across various applications, highlighting the need for vigilance and fact-checking in AI-generated content.
8 Sources
8 Sources
Recent research reveals that while larger AI language models demonstrate enhanced capabilities in answering questions, they also exhibit a concerning trend of increased confidence in incorrect responses. This phenomenon raises important questions about the development and deployment of advanced AI systems.
5 Sources
5 Sources
Recent research reveals GPT-4's ability to pass the Turing Test, raising questions about the test's validity as a measure of artificial general intelligence and prompting discussions on the nature of AI capabilities.
3 Sources
3 Sources
AI hallucinations, while often seen as a drawback, offer valuable insights for businesses and healthcare. This article explores the implications and potential benefits of AI hallucinations in various sectors.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved