4 Sources
[1]
Advanced AI Passes the Turing Test for the First Time - Neuroscience News
Summary: A milestone cognitive science study unveiled the first definitive empirical evidence that modern artificial intelligence can pass the iconic Turing test. The randomized, controlled study rigorously applied the 1950 framework created by British mathematician Alan Turing to evaluate whether state-of-the-art large language models (LLMs) could imitate human conversation so convincingly that real people could not tell them apart. Researchers discovered that when equipped with specific "persona" prompts, advanced models like GPT-4.5 were judged to be human 73% of the time, significantly outperforming actual human participants and fundamentally altering our understanding of machine intelligence. A new University of California San Diego study unveils the first empirical evidence that a modern artificial intelligence system can pass the Turing test -- a major scientific benchmark that asks whether a machine can imitate human conversation so convincingly that people can't reliably tell it apart from a real person. In a series of experiments, people were often unable to tell the difference between humans and advanced large language models (LLMs). The study published in the Proceedings of the National Academy of Sciences is the first to rigorously test LLMs with the method that British mathematician and "father of computer science" Alan Turing created in 1950. It is also the first time anyone has found that models were judged to be human as often as actual humans using the Turing framework. "What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans," said the study's corresponding author Cameron Jones, who conducted the research while earning his doctorate in Cognitive Science from UC San Diego. "While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI." In the test, a participant chats simultaneously with two other parties -- one is a human and the other is an LLM -- and the human "interrogator" must decide which party is the human. Across randomized, controlled, experiments with two independent participant groups -- UC San Diego undergraduates and online participants -- interrogators held brief, text-based conversations and then made their judgments. In the experiments participants chatted with four different LLMs -- GPT-4.5 and LLaMa-3.1-405B as state-of-the-art models -- and the researchers also included older baseline models for comparison. Those models included GPT-4o and ELIZA, a classic 1960s rules-based chatbot. Across the four LLMs, GPT-4.5 was judged to be the human 73% of the time, meaning interrogators selected it as "human" significantly more often than they selected the real human participant. LLaMa-3.1-405B, given the same prompt, was judged human 56% of the time -- statistically indistinguishable from the humans it was compared against. Baseline systems performed far worse: ELIZA and GPT-4o were selected as human only about 23% and 21% of the time overall. The study also reveals that scientists need to start thinking of the Turing Test differently, according to the authors. "The Turing test started as a way to ask whether machines could rival human intelligence," said study coauthor Ben Bergen, a professor of cognitive science at UC San Diego. "But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn't raw brainpower. Seeing that machines can pass the test -- and seeing how they pass it -- forces us to rethink what it measures. Increasingly, it's measuring humanlikeness." Each LLM was given a "persona" prompt, asking it to adopt a specific human character and communication style. Bergen explained that the LLMs were not winning through shows of force of knowledge, they were winning because they made mistakes like a human would. "These traits aren't the kinds of math and logic problem-solving intelligence that I think Turing was imagining." Without explicit instructions, the models were far less likely to be mistaken for human: GPT-4.5 fell to a 36% win rate and LLaMa-3.1 to 38%, while baseline systems ELIZA (23%) and GPT-4o (21%) were chosen as human even less often. The same systems that could pass as human when given detailed instructions on what kind of character they should play were unable to adopt such characteristics without that guidance -- suggesting that while the models can behave in convincingly human ways, they often need humans to tell them how. "They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like," Bergen said. The results carry real-world implications for trust online -- especially because the models that pass as human do so over the course of extended five or 15-minute conversations. "It's relatively easy to prompt these models to be indistinguishable from humans," said Jones, who is now an assistant professor of Psychology at Stony Brook University. "We need to be more alert; when you interact with strangers online people should be much less confident that they know they're talking to a human rather than an LLM." He also points to darker risks. "The Turing test is a game about lying for the models," Jones said. "One of the implications is that models seem to be really good at that." Bergen added that being unable to discern whether you're interacting with a human or bot can have serious consequences. "There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product," he said. The researchers note they hope the work sharpens public understanding of what these systems can now do -- and what kinds of safeguards society may need. To run the study, the researchers built an online interface designed to feel like a familiar messaging app. "For the interrogator, they have a split screen on their computer and they're asking questions to both witnesses," Jones said. "They know that one of those witnesses is a human and one of them is an AI." After five minutes -- and in a separate replication study, 15 minutes -- the interrogator had to decide which conversational partner was the real human. To confirm the results held beyond a single population, the researchers ran the study with two groups: UC San Diego undergraduates recruited through the SONA system and a broader online sample recruited through Prolific, a platform that pays participants to complete research studies. Nearly 500 people participated across the experiments. UC San Diego participants performed slightly better overall, possibly because they shared more "common ground" that could be used to probe one another, such as shared experiences and local campus details. A version of the Turing test interface used in the study is available at turingtest.live. Large Language Models Pass a Standard Three-Party Turing Test The Turing test has been widely discussed as a test of machine intelligence, but it also provides a measure of how humans distinguish other humans from machines. We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomized, controlled, and preregistered Turing tests on independent populations. Participants had 5 min conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans it was being compared to. Without these prompts, however, the same models performed significantly worse (38% and 36%), and did not consistently outperform baseline models, ELIZA and GPT-4o (23% and 21%, respectively). A third study replicated these results in 15-min games: two PERSONA-prompted models achieved pass rates of 56% and 59%. The results constitute empirical evidence that artificial systems can pass a standard three-party Turing test. Interrogators' reasoning focused more on stylistic and socio-emotional aspects of human behavior rather than more traditional notions of intelligence. The results have implications for debates about what kind of intelligence is exhibited by large language models, the social impacts these systems are likely to have, and the aspects of human behavior that people continue to see as unique.
[2]
Modern AI can now fool people into thinking it's human
In 1950, the British mathematician Alan Turing proposed a deceptively simple challenge: could a machine ever hold a conversation so convincingly human that people couldn't reliably tell the difference? For decades, the answer was no. A new study from the University of California San Diego suggests the answer has quietly become yes - and that the implications deserve more attention than they're getting. The research is the first to rigorously test modern AI language models using Turing's original framework. It is also the first to find that these models were judged to be human as often as actual humans - and in one case, significantly more often. The setup was deliberately simple. A human "interrogator" held simultaneous text conversations with two other parties - one a real person, one an AI - and then had to decide which was which. Conversations lasted either five or 15 minutes. Nearly 500 people participated across two groups: UC San Diego undergraduates and a broader online sample. Four AI systems were tested. The two state-of-the-art models were GPT-4.5 and LLaMa-3.1-405B. For comparison, the researchers also included GPT-4o and ELIZA - a rules-based chatbot from the 1960s that has been used as a reference point in AI research ever since. The results were striking. GPT-4.5 was identified as the human 73% of the time. This means interrogators chose it as the real person more often than they chose the actual human sitting on the other side of the conversation. LLaMa-3.1-405B was judged human 56% of the time, statistically indistinguishable from its human counterparts. The older systems were nowhere near: ELIZA and GPT-4o were identified as human only 23% and 21% of the time respectively. It wasn't intelligence that fooled people - it was personality. The AI models that passed weren't winning by being smarter than humans or knowing more, but by acting more human. Each model was given a "persona" prompt instructing it to adopt a specific human character and communication style. That instruction made an enormous difference: without it, GPT-4.5's success rate dropped from 73% to 36% and LLaMa-3.1 fell from 56% to 38 percent. Study lead author Cameron Jones conducted the research while completing his doctorate in cognitive science at UC San Diego and is now an assistant professor of psychology at Stony Brook University. "What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor, and fallibility as humans," said Jones. "While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI." The models could behave in convincingly human ways, but mostly when told exactly how to do it. Left to their own devices, they were much less convincing. "They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like," said co-author Ben Bergen, a professor of cognitive science at UC San Diego. Seventy-six years after Turing first posed his question, the test turns out to be measuring something rather different from what he originally intended. "The Turing test started as a way to ask whether machines could rival human intelligence," Bergen said. "But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn't raw brainpower." "Seeing that machines can pass the test - and seeing how they pass it - forces us to rethink what it measures. Increasingly, it's measuring humanlikeness." Raw intelligence - answering questions, solving problems, processing information - is something we've accepted AI can do. What's newer, and stranger, is AI that can mimic the texture of being human: the hesitations, the jokes, the sense that there's a person on the other end of the conversation. The practical implications are uncomfortable. These models aren't passing the Turing test in carefully controlled laboratory conditions far removed from everyday life. They're passing it in conversations of the length and type that happen constantly online - a five-minute exchange, a fifteen-minute chat. "It's relatively easy to prompt these models to be indistinguishable from humans," Jones said. "We need to be more alert; when you interact with strangers online people should be much less confident that they know they're talking to a human rather than an LLM." "The Turing test is a game about lying for the models. One of the implications is that models seem to be really good at that." "There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product," Bergen added. None of this means that AI passing the Turing test is purely bad news - the researchers are careful not to frame it that way. But it does mean that a capability that many people assumed was still comfortably in the future has arrived. The study is published in the journal Proceedings of the National Academy of Sciences. -- - Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
[3]
AI can pass the Turing Test in live chats and appear more human than us. I am spooked now
UC San Diego researchers found GPT-4.5 was judged human 73% of the time in live conversations AI can pass the Turing Test in live chats, and the latest result lands with a chill. In a UC San Diego study, GPT-4.5 outperformed real participants at convincing judges there was a person on the other side. The setup was harder to shrug off than a standard benchmark. Judges reacted to real-time exchanges rather than static prompts, then made a fast call based on conversation alone. Recommended Videos The unsettling part is how familiar the skill looks. The model didn't need a body, a voice, or a biography. It only needed to sound like someone. How did AI beat the human test The study used a three-party version of the test. Judges chatted with both a person and an AI model, then chose which one they thought was real. GPT-4.5 was identified as human 73% of the time when it was given a persona prompt. LLaMa-3.1-405B also crossed a striking line, getting picked as human 56% of the time with a persona prompt. Those numbers give the finding its bite. The model didn't merely avoid detection, it gave judges enough social cues to read it as the person in the chat. Why does this test still matter The Turing Test is a decades-old way to ask whether a machine can imitate human conversation well enough to fool a person. In the classic version, an evaluator chats without seeing the participants, then tries to tell the human apart from the machine. It has always been more cultural symbol than clean measurement. Still, it remains the test people recognize when they want to know whether software can pass for one of us. That makes the new result feel sharper. A chatbot doesn't need consciousness, emotion, or self-awareness to create the impression that a real person is typing back. It only needs to be believable in the moment. The risk shows up in ordinary places. Customer support, dating apps, social platforms, education, and political messaging all rely on quick judgments about identity, intent, and authenticity. What should we watch next The study stops well short of saying chatbots understand people. Its more practical finding is that some models can now perform personhood extremely well in short exchanges. Clearer disclosure should become the next pressure point. When a bot can blend into casual conversation, users need stronger signals that they're dealing with software, especially in places where persuasion or emotional vulnerability shapes the exchange. The next fight is over labeling in chats where people make fast decisions about trust.
[4]
AI Can Seem More Human Than Real Humans in a Classic Turing Test, Study Finds | Newswise
Newswise -- A new University of California San Diego study unveils the first empirical evidence that a modern artificial intelligence system can pass the Turing test -- a major scientific benchmark that asks whether a machine can imitate human conversation so convincingly that people can't reliably tell it apart from a real person. In a series of experiments, people were often unable to tell the difference between humans and advanced large language models (LLMs). The study published in the Proceedings of the National Academy of Sciences is the first to rigorously test LLMs with the method that British mathematician and "father of computer science" Alan Turing created in 1950. It is also the first time anyone has found that models were judged to be human as often as actual humans using the Turing framework. "What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans," said the study's corresponding author Cameron Jones, who conducted the research while earning his doctorate in Cognitive Science from UC San Diego. "While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI." In the test, a participant chats simultaneously with two other parties -- one is a human and the other is an LLM -- and the human "interrogator" must decide which party is the human. Across randomized, controlled, experiments with two independent participant groups -- UC San Diego undergraduates and online participants -- interrogators held brief, text-based conversations and then made their judgments. In the experiments participants chatted with four different LLMs -- GPT-4.5 and LLaMa-3.1-405B as state-of-the-art models -- and the researchers also included older baseline models for comparison. Those models included GPT-4o and ELIZA, a classic 1960s rules-based chatbot. Across the four LLMs, GPT-4.5 was judged to be the human 73% of the time, meaning interrogators selected it as "human" significantly more often than they selected the real human participant. LLaMa-3.1-405B, given the same prompt, was judged human 56% of the time -- statistically indistinguishable from the humans it was compared against. Baseline systems performed far worse: ELIZA and GPT-4o were selected as human only about 23% and 21% of the time overall. The study also reveals that scientists need to start thinking of the Turing Test differently, according to the authors. "The Turing test started as a way to ask whether machines could rival human intelligence," said study coauthor Ben Bergen, a professor of cognitive science at UC San Diego. "But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn't raw brainpower. Seeing that machines can pass the test -- and seeing how they pass it -- forces us to rethink what it measures. Increasingly, it's measuring humanlikeness." Each LLM was given a "persona" prompt, asking it to adopt a specific human character and communication style. Bergen explained that the LLMs were not winning through shows of force of knowledge, they were winning because they made mistakes like a human would. "These traits aren't the kinds of math and logic problem-solving intelligence that I think Turing was imagining." Without explicit instructions, the models were far less likely to be mistaken for human: GPT-4.5 fell to a 36% win rate and LLaMa-3.1 to 38%, while baseline systems ELIZA (23%) and GPT-4o (21%) were chosen as human even less often. The same systems that could pass as human when given detailed instructions on what kind of character they should play were unable to adopt such characteristics without that guidance -- suggesting that while the models can behave in convincingly human ways, they often need humans to tell them how. "They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like," Bergen said. The results carry real-world implications for trust online -- especially because the models that pass as human do so over the course of extended five or 15-minute conversations. "It's relatively easy to prompt these models to be indistinguishable from humans," said Jones, who is now an assistant professor of Psychology at Stony Brook University. "We need to be more alert; when you interact with strangers online people should be much less confident that they know they're talking to a human rather than an LLM." He also points to darker risks. "The Turing test is a game about lying for the models," Jones said. "One of the implications is that models seem to be really good at that." Bergen added that being unable to discern whether you're interacting with a human or bot can have serious consequences. "There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product," he said. The researchers note they hope the work sharpens public understanding of what these systems can now do -- and what kinds of safeguards society may need. To run the study, the researchers built an online interface designed to feel like a familiar messaging app. "For the interrogator, they have a split screen on their computer and they're asking questions to both witnesses," Jones said. "They know that one of those witnesses is a human and one of them is an AI." After five minutes -- and in a separate replication study, 15 minutes -- the interrogator had to decide which conversational partner was the real human. To confirm the results held beyond a single population, the researchers ran the study with two groups: UC San Diego undergraduates recruited through the SONA system and a broader online sample recruited through Prolific, a platform that pays participants to complete research studies. Nearly 500 people participated across the experiments. UC San Diego participants performed slightly better overall, possibly because they shared more "common ground" that could be used to probe one another, such as shared experiences and local campus details. A version of the Turing test interface used in the study is available at turingtest.live.
Share
Copy Link
A University of California San Diego study provides the first empirical evidence that AI passes the Turing Test. GPT-4.5 was judged human 73% of the time—more often than actual humans—fundamentally changing how we measure machine intelligence. The findings raise urgent questions about online trust and distinguishing between humans and AI in everyday digital interactions.
A groundbreaking University of California San Diego study has delivered the first definitive empirical evidence that modern AI can pass the Turing Test, the iconic benchmark created by British mathematician Alan Turing in 1950
1
. Published in the Proceedings of the National Academy of Sciences, the research tested whether advanced large language models could mimic human conversation so convincingly that people couldn't reliably tell them apart from real humans4
. The results mark a pivotal moment: GPT-4.5 was judged to be human 73% of the time, significantly outperforming actual human participants in the same test1
. This represents the first time anyone has found that AI can seem more human than real humans using the Turing framework.
Source: Neuroscience News
The study's methodology was deliberately rigorous. Nearly 500 participants—including UC San Diego undergraduates and online volunteers—served as interrogators in text-based conversations lasting five or 15 minutes
2
. Each interrogator chatted simultaneously with two parties: one human, one AI. Their task was simple but challenging—decide which was which. Researchers tested four AI systems: state-of-the-art models GPT-4.5 and LLaMa-3.1-405B, alongside older baseline systems GPT-4o and ELIZA, a 1960s rules-based chatbot1
. The performance gap was dramatic. While LLaMa-3.1-405B was judged human 56% of the time—statistically indistinguishable from actual humans—the baseline systems ELIZA and GPT-4o were identified as human only 23% and 21% of the time respectively4
.
Source: Newswise
The secret to AI's success wasn't raw intelligence but carefully crafted persona prompts that instructed models to adopt specific human characters and communication styles
2
. "What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans," explained study lead author Cameron Jones, who conducted the research while earning his doctorate in Cognitive Science from UC San Diego1
. Without these explicit instructions, performance plummeted: GPT-4.5 dropped from 73% to just 36%, and LLaMa-3.1 fell from 56% to 38%4
. Co-author Ben Bergen, a professor of cognitive science at UC San Diego, noted that models weren't winning through displays of knowledge—they succeeded by making mistakes like humans would. "They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like," Bergen observed2
.Related Stories
The findings force a fundamental reconsideration of what the Turing Test evaluates. "The Turing test started as a way to ask whether machines could rival human intelligence," Bergen explained. "But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn't raw brainpower. Increasingly, it's measuring humanlikeness"
1
. This shift matters because distinguishing between humans and AI becomes exponentially harder when machines excel at mimicking social behavioral traits rather than just processing information4
. The texture of being human—hesitations, jokes, the sense of a person behind the words—is now something AI can convincingly replicate in human conversation.The practical consequences extend far beyond laboratory settings. These models pass the Turing Test in conversations of the length and type that happen constantly in online interactions—customer support exchanges, dating apps, social platforms, and political messaging
3
. "It's relatively easy to prompt these models to be indistinguishable from humans," Jones warned. "When you interact with strangers online people should be much less confident that they know they're talking to a human rather than an LLM"4
. Jones, now an assistant professor of Psychology at Stony Brook University, highlighted darker risks: "The Turing test is a game about lying for the models. One of the implications is that models seem to be really good at that"2
. Bergen added that manipulation risks are real: "There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product"2
. The next critical battleground involves AI disclosure requirements and clearer labeling in digital spaces where trust and authenticity shape decisions.
Source: Earth.com
Summarized by
Navi
[1]
[3]
03 Apr 2025•Science and Research

05 Apr 2025•Science and Research

20 May 2025•Science and Research

1
Technology

2
Business and Economy

3
Health
