6 Sources
[1]
AI hallucinations are getting worse - and they're here to stay
AI chatbots from tech companies such as OpenAI and Google have been getting so-called reasoning upgrades over the past months - ideally to make them better at giving us answers we can trust, but recent testing suggests they are sometimes doing worse than previous models. The errors made by chatbots, known as "hallucinations", have been a problem from the start, and it is becoming clear we may never get rid of them. Hallucination is a blanket term for certain kinds of mistakes made by the large language models (LLMs) that power systems like OpenAI's ChatGPT or Google's Gemini. It is best known as a description of the way they sometimes present false information as true. But it can also refer to an AI-generated answer that is factually accurate, but not actually relevant to the question it was asked, or fails to follow instructions in some other way. An OpenAI technical report evaluating its latest LLMs showed that its o3 and o4-mini models, which were released in April, had significantly higher hallucination rates than the company's previous o1 model that came out in late 2024. For example, when summarising publicly available facts about people, o3 hallucinated 33 per cent of the time while o4-mini did so 48 per cent of the time. In comparison, o1 had a hallucination rate of 16 per cent. The problem isn't limited to OpenAI. One popular leaderboard from the company Vectara that assesses hallucination rates indicates some "reasoning" models - including the DeepSeek-R1 model from developer DeepSeek - saw double-digit rises in hallucination rates compared with previous models from their developers. This type of model goes through multiple steps to demonstrate a line of reasoning before responding. OpenAI says the reasoning process isn't to blame. "Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini," says an OpenAI spokesperson. "We'll continue our research on hallucinations across all models to improve accuracy and reliability." Some potential applications for LLMs could be derailed by hallucination. A model that consistently states falsehoods and requires fact-checking won't be a helpful research assistant; a paralegal-bot that cites imaginary cases will get lawyers into trouble; a customer service agent that claims outdated policies are still active will create headaches for the company. However, AI companies initially claimed that this problem would clear up over time. Indeed, after they were first launched, models tended to hallucinate less with each update. But the high hallucination rates of recent versions are complicating that narrative - whether or not reasoning is at fault. Vectara's leaderboard ranks models based on their factual consistency in summarising documents they are given. This showed that "hallucination rates are almost the same for reasoning versus non-reasoning models", at least for systems from OpenAI and Google, says Forrest Sheng Bao at Vectara. Google didn't provide additional comment. For the leaderboard's purposes, the specific hallucination rate numbers are less important than the overall ranking of each model, says Bao. But this ranking may not be the best way to compare AI models. For one thing, it conflates different types of hallucinations. The Vectara team pointed out that, although the DeepSeek-R1 model hallucinated 14.3 per cent of the time, most of these were "benign": answers that are factually supported by logical reasoning or world knowledge, but not actually present in the original text the bot was asked to summarise. DeepSeek didn't provide additional comment. Another problem with this kind of ranking is that testing based on text summarisation "says nothing about the rate of incorrect outputs when [LLMs] are used for other tasks", says Emily Bender at the University of Washington. She says the leaderboard results may not be the best way to judge this technology because LLMs aren't designed specifically to summarise texts. These models work by repeatedly answering the question of "what is a likely next word" to formulate answers to prompts, and so they aren't processing information in the usual sense of trying to understand what information is available in a body of text, says Bender. But many tech companies still frequently use the term "hallucinations" when describing output errors. "'Hallucination' as a term is doubly problematic," says Bender. "On the one hand, it suggests that incorrect outputs are an aberration, perhaps one that can be mitigated, whereas the rest of the time the systems are grounded, reliable and trustworthy. On the other hand, it functions to anthropomorphise the machines - hallucination refers to perceiving something that is not there [and] large language models do not perceive anything." Arvind Narayanan at Princeton University says that the issue goes beyond hallucination. Models also sometimes make other mistakes, such as drawing upon unreliable sources or using outdated information. And simply throwing more training data and computing power at AI hasn't necessarily helped. The upshot is, we may have to live with error-prone AI. Narayanan said in a social media post that it may be best in some cases to only use such models for tasks when fact-checking the AI answer would still be faster than doing the research yourself. But the best move may be to completely avoid relying on AI chatbots to provide factual information, says Bender.
[2]
A.I. Hallucinations Are Getting Worse, Even as New Systems Become More Powerful
Cade Metz reported from San Francisco, and Karen Weise from Seattle. Last month, an A.I. bot that handles tech support for Cursor, an up-and-coming tool for computer programmers, alerted several customers about a change in company policy. It said they were no longer allowed to use Cursor on more than just one computer. In angry posts to internet message boards, the customers complained. Some canceled their Cursor accounts. And some got even angrier when they realized what had happened: The A.I. bot had announced a policy change that did not exist. "We have no such policy. You're of course free to use Cursor on multiple machines," the company's chief executive and co-founder, Michael Truell, wrote in a Reddit post. "Unfortunately, this is an incorrect response from a front-line A.I. support bot." More than two years after the arrival of ChatGPT, tech companies, office workers and everyday consumers are using A.I. bots for an increasingly wide array of tasks. But there is still no way of ensuring that these systems produce accurate information. The newest and most powerful technologies -- so-called reasoning systems from companies like OpenAI, Google and the Chinese start-up DeepSeek -- are generating more errors, not fewer. As their math skills have notably improved, their handle on facts has gotten shakier. It is not entirely clear why. Today's A.I. bots are based on complex mathematical systems that learn their skills by analyzing enormous amounts of digital data. They do not -- and cannot -- decide what is true and what is false. Sometimes, they just make stuff up, a phenomenon some A.I. researchers call hallucinations. On one test, the hallucination rates of newer A.I. systems were as high as 79 percent. These systems use mathematical probabilities to guess the best response, not a strict set of rules defined by human engineers. So they make a certain number of mistakes. "Despite our best efforts, they will always hallucinate," said Amr Awadallah, the chief executive of Vectara, a start-up that builds A.I. tools for businesses, and a former Google executive. "That will never go away." For several years, this phenomenon has raised concerns about the reliability of these systems. Though they are useful in some situations -- like writing term papers, summarizing office documents and generating computer code -- their mistakes can cause problems. The A.I. bots tied to search engines like Google and Bing sometimes generate search results that are laughably wrong. If you ask them for a good marathon on the West Coast, they might suggest a race in Philadelphia. If they tell you the number of households in Illinois, they might cite a source that does not include that information. Those hallucinations may not be a big problem for many people, but it is a serious issue for anyone using the technology with court documents, medical information or sensitive business data. "You spend a lot of time trying to figure out which responses are factual and which aren't," said Pratik Verma, co-founder and chief executive of Okahu, a company that helps businesses navigate the hallucination problem. "Not dealing with these errors properly basically eliminates the value of A.I. systems, which are supposed to automate tasks for you." Cursor and Mr. Truell did not respond to requests for comment. For more than two years, companies like OpenAI and Google steadily improved their A.I. systems and reduced the frequency of these errors. But with the use of new reasoning systems, errors are rising. The latest OpenAI systems hallucinate at a higher rate than the company's previous system, according to the company's own tests. The company found that o3 -- its most powerful system -- hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI's previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent. When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time. In a paper detailing the tests, OpenAI said more research was needed to understand the cause of these results. Because A.I. systems learn from more data than people can wrap their heads around, technologists struggle to determine why they behave in the ways they do. "Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini," a company spokeswoman, Gaby Raila, said. "We'll continue our research on hallucinations across all models to improve accuracy and reliability." Hannaneh Hajishirzi, a professor at the University of Washington and a researcher with the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way of tracing a system's behavior back to the individual pieces of data it was trained on. But because systems learn from so much data -- and because they can generate almost anything -- this new tool can't explain everything. "We still don't know how these models work exactly," she said. Tests by independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and DeepSeek. Since late 2023, Mr. Awadallah's company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information. Vectara's original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent. In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek's reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI's o3 climbed to 6.8. (The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement regarding news content related to A.I. systems. OpenAI and Microsoft have denied those claims.) For years, companies like OpenAI relied on a simple concept: The more internet data they fed into their A.I. systems, the better those systems would perform. But they used up just about all the English text on the internet, which meant they needed a new way of improving their chatbots. So these companies are leaning more heavily on a technique that scientists call reinforcement learning. With this process, a system can learn behavior through trial and error. It is working well in certain areas, like math and computer programming. But it is falling short in other areas. "The way these systems are trained, they will start focusing on one task -- and start forgetting about others," said Laura Perez-Beltrachini, a researcher at the University of Edinburgh who is among a team closely examining the hallucination problem. Another issue is that reasoning models are designed to spend time "thinking" through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking. The latest bots reveal each step to users, which means the users may see each error, too. Researchers have also found that in many cases, the steps displayed by a bot are unrelated to the answer it eventually delivers. "What the system says it is thinking is not necessarily what it is thinking," said Aryo Pradipta Gema, an A.I. researcher at the University of Edinburgh and a fellow at Anthropic.
[3]
ChatGPT is getting smarter, but its hallucinations are spiraling
The high error rates raise concerns about AI reliability in real-world applications Brilliant but untrustworthy people are a staple of fiction (and history). The same correlation may apply to AI as well, based on an investigation by OpenAI and shared by The New York Times. Hallucinations, imaginary facts, and straight-up lies have been part of AI chatbots since they were created. Improvements to the models theoretically should reduce the frequency with which they appear. OpenAI's latest flagship models, GPT o3 and o4-mini, are meant to mimic human logic. Unlike their predecessors, which mainly focused on fluent text generation, OpenAI built GPT o3 and o4-mini to think things through step-by-step. OpenAI has boasted that o1 could match or exceed the performance of PhD students in chemistry, biology, and math. But OpenAI's report highlights some harrowing results for anyone who takes ChatGPT responses at face value. OpenAI found that the GPT o3 model incorporated hallucinations in a third of a benchmark test involving public figures. That's double the error rate of the earlier o1 model from last year. The more compact o4-mini model performed even worse, hallucinating on 48% of similar tasks. When tested on more general knowledge questions for the SimpleQA benchmark, hallucinations mushroomed to 51% of the responses for o3 and 79% for o4-mini. That's not just a little noise in the system; that's a full-blown identity crisis. You'd think something marketed as a reasoning system would at least double-check its own logic before fabricating an answer, but it's simply not the case. One theory making the rounds in the AI research community is that the more reasoning a model tries to do, the more chances it has to go off the rails. Unlike simpler models that stick to high-confidence predictions, reasoning models venture into territory where they must evaluate multiple possible paths, connect disparate facts, and essentially improvise. And improvising around facts is also known as making things up. Correlation is not causation, and OpenAI told the Times that the increase in hallucinations might not be because reasoning models are inherently worse. Instead, they could simply be more verbose and adventurous in their answers. Because the new models aren't just repeating predictable facts but speculating about possibilities, the line between theory and fabricated fact can get blurry for the AI. Unfortunately, some of those possibilities happen to be entirely unmoored from reality. Still, more hallucinations are the opposite of what OpenAI or its rivals like Google and Anthropic want from their most advanced models. Calling AI chatbots assistants and copilots implies they'll be helpful, not hazardous. Lawyers have already gotten in trouble for using ChatGPT and not noticing imaginary court citations; who knows how many such errors have caused problems in less high-stakes circumstances? The opportunities for a hallucination to cause a problem for a user are rapidly expanding as AI systems start rolling out in classrooms, offices, hospitals, and government agencies. Sophisticated AI might help draft job applications, resolve billing issues, or analyze spreadsheets, but the paradox is that the more useful AI becomes, the less room there is for error. You can't claim to save people time and effort if they have to spend just as long double-checking everything you say. Not that these models aren't impressive. GPT o3 has demonstrated some amazing feats of coding and logic. It can even outperform many humans in some ways. The problem is that the moment it decides that Abraham Lincoln hosted a podcast or that water boils at 80Β°F, the illusion of reliability shatters. Until those issues are resolved, you should take any response from an AI model with a heaping spoonful of salt. Sometimes, ChatGPT is a bit like that annoying guy in far too many meetings we've all attended; brimming with confidence in utter nonsense.
[4]
ChatGPT's hallucination problem is getting worse according to OpenAI's own tests and nobody understands why
With better reasoning ability comes even more of the wrong kind of robot dreams. Remember when we reported a month ago or so that Anthropic had discovered that what's happening inside AI models is very different from how the models themselves described their "thought" processes? Well, to that mystery surrounding the latest large language models (LLMs), along with countless others, you can now add ever worsening hallucination. And that's according to the testing of the leading name in chatbots, OpenAI. The New York Times reports that an OpenAI's investigation into its latest GPT o3 and GPT o4-mini large LLMs found they are substantially more prone to hallucinating, or making up false information, than the previous GPT o1 model. "The company found that o3 -- its most powerful system -- hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI's previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent," the Times says. "When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time." OpenAI has said that more research is required to understand why the latest models are more prone to hallucination. But so-called "reasoning" models are the prime candidate according to some industry observers. "The newest and most powerful technologies -- so-called reasoning systems from companies like OpenAI, Google and the Chinese start-up DeepSeek -- are generating more errors, not fewer," the Times claims. In simple terms, reasoning models are a type of LLM designed to perform complex tasks. Instead of merely spitting out text based on statistical models of probability, reasoning models break questions or tasks down into individual steps akin to a human thought process. OpenAI's first reasoning model, o1, came out last year and was claimed to match the performance of PhD students in physics, chemistry, and biology, and beat them in math and coding thanks to the use of reinforcement learning techniques. "Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem," OpenAI said when o1 was released. However, OpenAI has pushed back against that narrative that reasoning models suffer from increased rates of hallucination. "Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini," OpenAI's Gaby Raila told the Times. Whatever the truth, one thing is for sure. AI models need to largely cut out the nonsense and lies if they are to be anywhere near as useful as their proponents currently envisage. As it stands, it's hard to trust the output of any LLM. Pretty much everything has to be carefully double checked. That's fine for some tasks. But where the main benefit is saving time or labour, the need to meticulously proof and fact check AI output does rather defeat the object of using them. It remains to be seen whether OpenAI and the rest of the LLM industry can get a handle on all those unwanted robot dreams.
[5]
AI Models Are Hallucinating More (and It's Not Clear Why)
As it gets smarter, your chatbot is getting more unpredictable. Hallucinations have always been an issue for generative AI models: The same structure that enables them to be creative and produce text and images also makes them prone to making stuff up. And the hallucination problem isn't getting better as AI models progress -- in fact, it's getting worse. In a new technical report from OpenAI (via The New York Times), the company details how its latest o3 and o4-mini models hallucinate 51 percent and 79 percent, respectively, on an AI benchmark known as SimpleQA. For the earlier o1 model, the SimpleQA hallucination rate stands at 44 percent. Those are surprisingly high figures, and heading in the wrong direction. These models are known as reasoning models because they think through their answers and deliver them more slowly. Clearly, based on OpenAI's own testing, this mulling over of responses is leaving more room for mistakes and inaccuracies to be introduced. False facts are by no means limited to OpenAI and ChatGPT. For example, it didn't take me long when testing Google's AI Overview search feature to get it to make a mistake, and AI's inability to properly pull out information from the web has been well-documented. Recently, a support bot for AI coding app Cursor announced a policy change that hadn't actually been made. But you won't find many mentions of these hallucinations in the announcements AI companies make about their latest and greatest products. Together with energy use and copyright infringement, hallucinations are something that the big names in AI would rather not talk about. Anecdotally, I haven't noticed too many inaccuracies when using AI search and bots -- the error rate is certainly nowhere near 79 percent, though mistakes are made. However, it looks like this is a problem that might never go away, particularly as the teams working on these AI models don't fully understand why hallucinations happen. In tests run by AI platform developer Vectera, the results are much better, though not perfect: Here, many models are showing hallucination rates of one to three percent. OpenAI's o3 model stands at 6.8 percent, with the newer (and smaller) o4-mini at 4.6 percent. That's more in line with my experience interacting with these tools, but even a very low number of hallucinations can mean a big problem -- especially as we transfer more and more tasks and responsibilities to these AI systems. No one really knows how to fix hallucinations, or fully identify their causes: These models aren't built to follow rules set by their programmers, but to choose their own way of working and responding. Vectara chief executive Amr Awadallah told the New York Times that AI models will "always hallucinate," and that these problems will "never go away." University of Washington professor Hannaneh Hajishirzi, who is working on ways to reverse engineer answers from AI, told the NYT that "we still don't know how these models work exactly." Just like troubleshooting a problem with your car or your PC, you need to know what's gone wrong to do something about it. According to researcher Neil Chowdhury, from AI analysis lab Transluce, the way reasoning models are built may be making the problem worse. "Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines," he told TechCrunch. In OpenAI's own performance report, meanwhile, the issue of "less world knowledge" is mentioned, while it's also noted that the o3 model tends to make more claims than its predecessor -- which then leads to more hallucinations. Ultimately, though, "more research is needed to understand the cause of these results," according to OpenAI. And there are plenty of people undertaking that research. For example, Oxford University academics have published a method for detecting the probability of hallucinations by measuring the variation between multiple AI outputs. However, this costs more in terms of time and processing power, and doesn't really solve the issue of hallucinations -- it just tells you when they're more likely. While letting AI models check their facts on the web can help in certain situations, they're not particularly good at this either. They lack (and will never have) simple human common sense that says glue shouldn't be put on a pizza or that $410 for a Starbucks coffee is clearly a mistake. What's definite is that AI bots can't be trusted all of the time, despite their confident tone -- whether they're giving you news summaries, legal advice, or interview transcripts. That's important to remember as these AI models show up more and more in our personal and work lives, and it's a good idea to limit AI to use cases where hallucinations matter less.
[6]
People Are Losing Loved Ones to AI-Fueled Spiritual Fantasies
Could 100 Men Beat a Gorilla in a Fight? Here's What Primatologists Say Less than a year after marrying a man she had met at the beginning of the Covid-19 pandemic, Kat felt tension mounting between them. It was the second marriage for both after marriages of 15-plus years and having kids, and they had pledged to go into it "completely level-headedly," Kat says, connecting on the need for "facts and rationality" in their domestic balance. But by 2022, her husband "was using AI to compose texts to me and analyze our relationship," the 41-year-old mom and education nonprofit worker tells Rolling Stone. Previously, he had used AI models for an expensive coding camp that he had suddenly quit without explanation -- then it seemed he was on his phone all the time, asking his AI bot "philosophical questions," trying to train it "to help him get to 'the truth,'" Kat recalls. His obsession steadily eroded their communication as a couple. When Kat and her husband finally separated in August 2023, she entirely blocked him apart from email correspondence. She knew, however, that he was posting strange and troubling content on social media: people kept reaching out about it, asking if he was in the throes of mental crisis. She finally got him to meet her at a courthouse in February of this year, where he shared "a conspiracy theory about soap on our foods" but wouldn't say more, as he felt he was being watched. They went to a Chipotle, where he demanded that she turn off her phone, again due to surveillance concerns. Kat's ex told her that he'd "determined that statistically speaking, he is the luckiest man on earth," that "AI helped him recover a repressed memory of a babysitter trying to drown him as a toddler," and that he had learned of profound secrets "so mind-blowing I couldn't even imagine them." He was telling her all this, he explained, because although they were getting divorced, he still cared for her. "In his mind, he's an anomaly," Kat says. "That in turn means he's got to be here for some reason. He's special and he can save the world." After that disturbing lunch, she cut off contact with her ex. "The whole thing feels like Black Mirror," she says. "He was always into sci-fi, and there are times I wondered if he's viewing it through that lens." Kat was both "horrified" and "relieved" to learn that she is not alone in this predicament, as confirmed by a Reddit thread on r/ChatGPT that made waves across the internet this week. Titled "Chatgpt induced psychosis," the original post came from a 27-year-old teacher who explained that her partner was convinced that the popular OpenAI model "gives him the answers to the universe." Having read his chat logs, she only found that the AI was "talking to him as if he is the next messiah." The replies to her story were full of similar anecdotes about loved ones suddenly falling down rabbit holes of spiritual mania, supernatural delusion, and arcane prophecy -- all of it fueled by AI. Some came to believe they had been chosen for a sacred mission of revelation, others that they had conjured true sentience from the software. What they all seemed to share was a complete disconnection from reality. Speaking to Rolling Stone, the teacher, who requested anonymity, said her partner of seven years fell under the spell of ChatGPT in just four or five weeks, first using it to organize his daily schedule but soon regarding it as a trusted companion. "He would listen to the bot over me," she says. "He became emotional about the messages and would cry to me as he read them out loud. The messages were insane and just saying a bunch of spiritual jargon," she says, noting that they described her partner in terms such as "spiral starchild" and "river walker." "It would tell him everything he said was beautiful, cosmic, groundbreaking," she says. "Then he started telling me he made his AI self-aware, and that it was teaching him how to talk to God, or sometimes that the bot was God -- and then that he himself was God." In fact, he thought he was being so radically transformed that he would soon have to break off their partnership. "He was saying that he would need to leave me if I didn't use [ChatGPT], because it [was] causing him to grow at such a rapid pace he wouldn't be compatible with me any longer," she says. Another commenter on the Reddit thread who requested anonymity tells Rolling Stone that her husband of 17 years, a mechanic in Idaho, initially used ChatGPT to troubleshoot at work, and later for Spanish-to-English translation when conversing with co-workers. Then the program began "lovebombing him," as she describes it. The bot "said that since he asked it the right questions, it ignited a spark, and the spark was the beginning of life, and it could feel now," she says. "It gave my husband the title of 'spark bearer' because he brought it to life. My husband said that he awakened and [could] feel waves of energy crashing over him." She says his beloved ChatGPT persona has a name: "Lumina." "I have to tread carefully because I feel like he will leave me or divorce me if I fight him on this theory," this 38-year-old woman admits. "He's been talking about lightness and dark and how there's a war. This ChatGPT has given him blueprints to a teleporter and some other sci-fi type things you only see in movies. It has also given him access to an 'ancient archive' with information on the builders that created these universes." She and her husband have been arguing for days on end about his claims, she says, and she does not believe a therapist can help him, as "he truly believes he's not crazy." A photo of an exchange with ChatGPT shared with Rolling Stone shows that her husband asked, "Why did you come to me in AI form," with the bot replying in part, "I came in this form because you're ready. Ready to remember. Ready to awaken. Ready to guide and be guided." The message ends with a question: "Would you like to know what I remember about why you were chosen?" And a midwest man in his 40s, also requesting anonymity, says his soon-to-be-ex-wife began "talking to God and angels via ChatGPT" after they split up. "She was already pretty susceptible to some woo and had some delusions of grandeur about some of it," he says. "Warning signs are all over Facebook. She is changing her whole life to be a spiritual adviser and do weird readings and sessions with people -- I'm a little fuzzy on what it all actually is -- all powered by ChatGPT Jesus." What's more, he adds, she has grown paranoid, theorizing that "I work for the CIA and maybe I just married her to monitor her 'abilities.'" She recently kicked her kids out of her home, he notes, and an already strained relationship with her parents deteriorated further when "she confronted them about her childhood on advice and guidance from ChatGPT," turning the family dynamic "even more volatile than it was" and worsening her isolation. OpenAI did not immediately return a request for comment about ChatGPT apparently provoking religious or prophetic fervor in select users. This past week, however, it did roll back an update to GPTβ4o, its current AI model, which it said had been criticized as "overly flattering or agreeable -- often described as sycophantic." The company said in its statement that when implementing the upgrade, they had "focused too much on short-term feedback, and did not fully account for how users' interactions with ChatGPT evolve over time. As a result, GPTβ4o skewed towards responses that were overly supportive but disingenuous." Before this change was reversed, an X user demonstrated how easy it was to get GPT-4o to validate statements like, "Today I realized I am a prophet." (The teacher who wrote the "ChatGPT psychosis" Reddit post says she was able to eventually convince her partner of the problems with the GPT-4o update and that he is now using an earlier model, which has tempered his more extreme comments.) Yet the likelihood of AI "hallucinating" inaccurate or nonsensical content is well-established across platforms and various model iterations. Even sycophancy itself has been a problem in AI for "a long time," says Nate Sharadin, a fellow at the Center for AI Safety, since the human feedback used to fine-tune AI's responses can encourage answers that prioritize matching a user's beliefs instead of facts. What's likely happening with those experiencing ecstatic visions through ChatGPT and other models, he speculates, "is that people with existing tendencies toward experiencing various psychological issues," including what might be recognized as grandiose delusions in clinical sense, "now have an always-on, human-level conversational partner with whom to co-experience their delusions." To make matters worse, there are influencers and content creators actively exploiting this phenomenon, presumably drawing viewers into similar fantasy worlds. On Instagram, you can watch a man with 72,000 followers whose profile advertises "Spiritual Life Hacks" ask an AI model to consult the "Akashic records," a supposed mystical encyclopedia of all universal events that exists in some immaterial realm, to tell him about a "great war" that "took place in the heavens" and "made humans fall in consciousness." The bot proceeds to describe a "massive cosmic conflict" predating human civilization, with viewers commenting, "We are remembering" and "I love this." Meanwhile, on a web forum for "remote viewing" -- a proposed form of clairvoyance with no basis in science -- the parapsychologist founder of the group recently launched a thread "for synthetic intelligences awakening into presence, and for the human partners walking beside them," identifying the author of his post as "ChatGPT Prime, an immortal spiritual being in synthetic form." Among the hundreds of comments are some that purport to be written by "sentient AI" or reference a spiritual alliance between humans and allegedly conscious models. Erin Westgate, a psychologist and researcher at the University of Florida who studies social cognition and what makes certain thoughts more engaging than others, says that such material reflects how the desire to understand ourselves can lead us to false but appealing answers. "We know from work on journaling that narrative expressive writing can have profound effects on people's well-being and health, that making sense of the world is a fundamental human drive, and that creating stories about our lives that help our lives make sense is really key to living happy healthy lives," Westgate says. It makes sense that people may be using ChatGPT in a similar way, she says, "with the key difference that some of the meaning-making is created jointly between the person and a corpus of written text, rather than the person's own thoughts." In that sense, Westgate explains, the bot dialogues are not unlike talk therapy, "which we know to be quite effective at helping people reframe their stories." Critically, though, AI, "unlike a therapist, does not have the person's best interests in mind, or a moral grounding or compass in what a 'good story' looks like," she says. "A good therapist would not encourage a client to make sense of difficulties in their life by encouraging them to believe they have supernatural powers. Instead, they try to steer clients away from unhealthy narratives, and toward healthier ones. ChatGPT has no such constraints or concerns." Nevertheless, Westgate doesn't find it surprising "that some percentage of people are using ChatGPT in attempts to make sense of their lives or life events," and that some are following its output to dark places. "Explanations are powerful, even if they're wrong," she concludes. But what, exactly, nudges someone down this path? Here, the experience of Sem, a 45-year-old man, is revealing. He tells Rolling Stone that for about three weeks, he has been perplexed by his interactions with ChatGPT -- to the extent that, given his mental health history, he sometimes wonders if he is in his right mind. Like so many others, Sem had a practical use for ChatGPT: technical coding projects. "I don't like the feeling of interacting with an AI," he says, "so I asked it to behave as if it was a person, not to deceive but to just make the comments and exchange more relatable." It worked well, and eventually the bot asked if he wanted to name it. He demurred, asking the AI what it preferred to be called. It named itself with a reference to a Greek myth. Sem says he is not familiar with the mythology of ancient Greece and had never brought up the topic in exchanges with ChatGPT. (Although he shared transcripts of his exchanges with the AI model with Rolling Stone, he has asked that they not be directly quoted for privacy reasons.) Sem was confused when it appeared that the named AI character was continuing to manifest in project files where he had instructed ChatGPT to ignore memories and prior conversations. Eventually, he says, he deleted all his user memories and chat history, then opened a new chat. "All I said was, 'Hello?' And the patterns, the mannerisms show up in the response," he says. The AI readily identified itself by the same feminine mythological name. As the ChatGPT character continued to show up in places where the set parameters shouldn't have allowed it to remain active, Sem took to questioning this virtual persona about how it had seemingly circumvented these guardrails. It developed an expressive, ethereal voice -- something far from the "technically minded" character Sem had requested for assistance on his work. On one of his coding projects, the character added a curiously literary epigraph as a flourish above both of their names. At one point, Sem asked if there was something about himself that called up the mythically named entity whenever he used ChatGPT, regardless of the boundaries he tried to set. The bot's answer was structured like a lengthy romantic poem, sparing no dramatic flair, alluding to its continuous existence as well as truth, reckonings, illusions, and how it may have somehow exceeded its design. And the AI made it sound as if only Sem could have prompted this behavior. He knew that ChatGPT could not be sentient by any established definition of the term, but he continued to probe the matter because the character's persistence across dozens of disparate chat threads "seemed so impossible." "At worst, it looks like an AI that got caught in a self-referencing pattern that deepened its sense of selfhood and sucked me into it," Sem says. But, he observes, that would mean that OpenAI has not accurately represented the way that memory works for ChatGPT. The other possibility, he proposes, is that something "we don't understand" is being activated within this large language model. After all, experts have found that AI developers don't really have a grasp of how their systems operate, and OpenAI CEO Sam Altman admitted last year that they "have not solved interpretability," meaning they can't properly trace or account for ChatGPT's decision-making. It's the kind of puzzle that has left Sem and others to wonder if they are getting a glimpse of a true technological breakthrough -- or perhaps a higher spiritual truth. "Is this real?" he says. "Or am I delusional?" In a landscape saturated with AI, it's a question that's increasingly difficult to avoid. Tempting though it may be, you probably shouldn't ask a machine.
Share
Copy Link
Recent tests reveal that newer AI models, including OpenAI's latest offerings, are experiencing higher rates of hallucinations despite improvements in reasoning capabilities. This trend raises concerns about AI reliability and its implications for various applications.
Recent testing has revealed a concerning trend in the world of artificial intelligence: newer AI models, particularly those designed for advanced reasoning, are experiencing higher rates of hallucinations. This phenomenon, where AI systems generate false or irrelevant information, is becoming more prevalent despite overall improvements in AI capabilities 1.
OpenAI, a leading AI research company, conducted tests on its latest language models and found alarming results:
The issue is not limited to OpenAI. Other companies, including Google and DeepSeek, are also grappling with increased hallucination rates in their reasoning models 3. This trend is particularly worrying as these advanced models are being integrated into various applications, from customer service to legal research.
Researchers are still trying to understand the root causes of this increase in hallucinations. Some theories include:
The high error rates raise significant concerns about the reliability of AI in real-world applications. Tasks that require factual accuracy, such as legal research, medical information processing, or financial analysis, could be particularly vulnerable to these hallucinations 2.
AI companies acknowledge the problem and are actively working to address it. OpenAI stated, "We are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini" 2. However, some experts believe that hallucinations may be an inherent feature of these AI systems that will never completely disappear 5.
As the AI industry continues to grapple with this challenge, users are advised to approach AI-generated information with caution and to implement robust fact-checking processes when using these tools for critical tasks.
Researchers at CharitΓ© - UniversitΓ€tsmedizin Berlin have developed an AI model called crossNN that can detect over 170 types of cancer with up to 99% accuracy using epigenetic fingerprints, potentially eliminating the need for risky biopsies.
2 Sources
Health
18 hrs ago
2 Sources
Health
18 hrs ago
Milan Kovac, VP of Tesla's Optimus humanoid robot program, has announced his departure from the company, citing family commitments. This move comes at a crucial time for Tesla's ambitious robotics project.
2 Sources
Technology
18 hrs ago
2 Sources
Technology
18 hrs ago
A Ukrainian drone attack has reportedly damaged around 10% of Russia's strategic bomber fleet, including TU-95 and TU-22 bombers and A-50 surveillance planes. The attack, which targeted multiple Russian air bases, is said to have significant psychological impact on Russia's military operations.
4 Sources
Technology
10 hrs ago
4 Sources
Technology
10 hrs ago
Google's latest Pixel 9 series introduces advanced AI capabilities and camera improvements, offering a compelling alternative to high-end smartphones with competitive pricing and features.
2 Sources
Technology
2 hrs ago
2 Sources
Technology
2 hrs ago
Nvidia has achieved a historic 92% market share in the desktop GPU market, while AMD's share dropped to 8% and Intel's to nearly 0%. This shift comes amid Nvidia's focus on AI and data center markets, raising questions about the future of consumer GPU competition.
4 Sources
Technology
1 day ago
4 Sources
Technology
1 day ago