6 Sources
[1]
ChatGPT just passed the Turing test. But that doesn't mean AI is now as smart as humans
There have been several headlines over the past week about an AI chatbot officially passing the Turing test. These news reports are based on a recent preprint study by two researchers at the University of California San Diego in which four large language models (LLMs) were put through the Turing test. One model - OpenAI's GPT-4.5 - was deemed indistinguishable from a human more than 70% of the time. The Turing test has been popularised as the ultimate indicator of machine intelligence. However, there is disagreement about the validity of this test. In fact, it has a contentious history which calls into question how effective it really is at measuring machine intelligence. So what does this mean for the significance of this new study? What did the study find? The preprint study by cognitive scientists Cameron Jones and Benjamin Bergen was published in March, but has not yet been peer-reviewed. It tested 4 LLMs: ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5. The tests consisted of participants completing eight rounds of conversations in which they acted as an interrogator exchanging text messages with two witnesses simultaneously. One witness was a human and the other an LLM. Some 284 participants were randomly assigned to be either the interrogator or the witness. Participants were required to interact with both witnesses simultaneously across a split screen for five minutes, with the test interface mimicking that of a conventional messaging interface. Following this interaction, they decided which witness was a human, and which was an AI chatbot. Participants judged GPT-4.5 to be the human 73% of the time, and LLaMa-3.1-405B to be the human 56% of the time. The other two models (ELIZA and GPT-4o) only fooled participants 23% and 21% of the time respectively. What exactly is the Turing Test? The first iteration of the Turing test was presented by English mathematician and computer scientist Alan Turing in a 1948 paper titled "Intelligent Machinery". It was originally proposed as an experiment involving three people playing chess with a theoretical machine referred to as a paper machine, two being players and one being an operator. In the 1950 publication "Computing Machinery and Intelligence", Turing reintroduced the experiment as the "imitation game" and claimed it was a means of determining a machine's ability to exhibit intelligent behaviour equivalent to a human. It involved three participants: Participant A was a woman, participant B a man and participant C either gender. Through a series of questions, participant C is required to determine whether "X is A and Y is B" or "X is B and Y is A", with X and Y representing the two genders. A proposition is then raised: "What will happen when a machine takes the part of A in this game? Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman?" These questions were intended to replace the ambiguous question, "Can machines think?". Turing claimed this question was ambiguous because it required an understanding of the terms "machine" and "think", of which "normal" uses of the words would render a response to the question inadequate. Over the years, this experiment was popularised as the Turing test. While the subject matter varied, the test remained a deliberation on whether "X is A and Y is B" or "X is B and Y is A". Why is it contentious? While popularised as a means of testing machine intelligence, the Turing test is not unanimously accepted as an accurate means to do so. In fact, the test is frequently challenged. There are four main objections to the Turing test: Behaviour vs thinking. Some researchers argue the ability to "pass" the test is a matter of behaviour, not intelligence. Therefore it would not be contradictory to say a machine can pass the imitation game, but cannot think. Brains are not machines. Turing makes assertions the brain is a machine, claiming it can be explained in purely mechanical terms. Many academics refute this claim and question the validity of the test on this basis. Internal operations. As computers are not humans, their process for reaching a conclusion may not be comparable to a person's, making the test inadequate because a direct comparison cannot work. Scope of the test. Some researchers believe only testing one behaviour is not enough to determine intelligence. So is an LLM as smart as a human? While the preprint article claims GPT-4.5 passed the Turing test, it also states: the Turing test is a measure of substitutability: whether a system can stand-in for a real person without [...] noticing the difference. This implies the researchers do not support the idea of the Turing test being a legitimate indication of human intelligence. Rather, it is an indication of the imitation of human intelligence - an ode to the origins of the test. It is also worth noting that the conditions of the study were not without issue. For example, a five minute testing window is relatively short. In addition, each of the LLMs was prompted to adopt a particular persona, but it's unclear what the details and impact of the "personas" were on the test. For now it is safe to say GPT-4.5 is not as intelligent as humans - although it may do a reasonable job of convincing some people otherwise.
[2]
GPT 4.5 achieves 73% Turing Test success, blurring human-AI lines
The results indicate that interrogators often mistook these AI models for human participants, suggesting that the Turing Test can, at least in certain settings, be outmaneuvered by the latest generation of AI chatbots. According to lead researcher Cameron Jones, GPT‑4.5 with a strategic "PERSONA" prompt managed a win rate of 73% -- meaning that in five-minute chat sessions, the AI system was identified as the human more often than the actual human was. Llama‑3.1‑405B also crossed this threshold (albeit at a lower 56% win rate) when similarly prompted to adopt a specific persona. By contrast, GPT‑4o, a reference model presumably powering today's widely used ChatGPT, only managed a 21% success rate under minimal instructions. These results have reignited the debate about whether Turing's imitation game is still a meaningful measure of human-like intelligence or if it mostly underscores modern AI's ability to imitate human conversation. The study also spotlighted changes in how we, as human interrogators, approach suspiciously fluent "people" on the other side of a text window. Do eloquent chatbots too easily convince us, or have AI models truly vaulted over an iconic threshold of computational thinking? A British mathematician and computer scientist, Alan Turing, first proposed his imitation game in 1950 as a thought experiment. If an interrogator could not reliably tell the difference between a human and a hidden machine in text-based conversation, Turing reasoned the machine might be said to "think."
[3]
ChatGPT just passed the Turing test -- but that doesn't mean AI is now as smart as humans
These news reports are based on a recent preprint study by two researchers at the University of California San Diego in which four large language models (LLMs) were put through the Turing test. One model -- OpenAI's GPT-4.5 -- was deemed indistinguishable from a human more than 70% of the time. The Turing test has been popularized as the ultimate indicator of machine intelligence. However, there is disagreement about the validity of this test. In fact, it has a contentious history which calls into question how effective it really is at measuring machine intelligence. So what does this mean for the significance of this new study? The tests consisted of participants completing eight rounds of conversations in which they acted as an interrogator exchanging text messages with two witnesses simultaneously. One witness was a human and the other an LLM. Some 284 participants were randomly assigned to be either the interrogator or the witness. Participants were required to interact with both witnesses simultaneously across a split screen for five minutes, with the test interface mimicking that of a conventional messaging interface. Following this interaction, they decided which witness was a human, and which was an AI chatbot. Participants judged GPT-4.5 to be the human 73% of the time, and LLaMa-3.1-405B to be the human 56% of the time. The other two models (ELIZA and GPT-4o) only fooled participants 23% and 21% of the time respectively. What exactly is the Turing Test? The first iteration of the Turing test was presented by English mathematician and computer scientist Alan Turing in a 1948 paper titled "Intelligent Machinery." It was originally proposed as an experiment involving three people playing chess with a theoretical machine referred to as a paper machine, two being players and one being an operator. In the 1950 publication "Computing Machinery and Intelligence," Turing reintroduced the experiment as the "imitation game" and claimed it was a means of determining a machine's ability to exhibit intelligent behavior equivalent to a human. It involved three participants: Participant A was a woman, participant B a man and participant C either gender. Through a series of questions, participant C is required to determine whether "X is A and Y is B" or "X is B and Y is A," with X and Y representing the two genders. A proposition is then raised: "What will happen when a machine takes the part of A in this game? Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman?" These questions were intended to replace the ambiguous question, "Can machines think?". Turing claimed this question was ambiguous because it required an understanding of the terms "machine" and "think," of which "normal" uses of the words would render a response to the question inadequate. Over the years, this experiment was popularized as the Turing test. While the subject matter varied, the test remained a deliberation on whether "X is A and Y is B" or "X is B and Y is A." Why is it contentious? While popularized as a means of testing machine intelligence, the Turing test is not unanimously accepted as an accurate means to do so. In fact, the test is frequently challenged. While the preprint article claims GPT-4.5 passed the Turing test, it also states, "The Turing test is a measure of substitutability: whether a system can stand-in for a real person without [...] noticing the difference." This implies the researchers do not support the idea of the Turing test being a legitimate indication of human intelligence. Rather, it is an indication of the imitation of human intelligence -- an ode to the origins of the test. It is also worth noting that the conditions of the study were not without issue. For example, a five minute testing window is relatively short. In addition, each of the LLMs was prompted to adopt a particular persona, but it's unclear what the details and impact of the "personas" were on the test. For now, it is safe to say GPT-4.5 is not as intelligent as humans -- although it may do a reasonable job of convincing some people otherwise.
[4]
An AI Model Has Officially Passed the Turing Test
One of the industry's leading large language models has passed a Turing test, a longstanding barometer for human-like intelligence. In a new preprint study awaiting peer review, researchers report that in a three-party version of a Turing test, in which participants chat with a human and an AI at the same time and then evaluate which is which, OpenAI's GPT-4.5 model was deemed to be the human 73 percent of the time when it was instructed to adopt a persona. That's significantly higher than a random chance of 50 percent, suggesting that the Turing test has resoundingly been beaten. The research also evaluated Meta's LLama 3.1-405B model, OpenAI's GPT-4o model, and an early chatbot known as ELIZA developed some eighty years ago. "People were no better than chance at distinguishing humans from GPT-4.5 and LLaMa (with the persona prompt)," wrote lead author Cameron Jones, a researcher at UC San Diego's Language and Cognition Lab, in an X thread about the work. "And 4.5 was even judged to be human significantly more often than actual humans!" The Turing test is named after British mathematician and computer scientist Alan Turing. In 1950, Turing proposed that one way to assess a machine's intelligence was by having it engage in text-based conversations with a human interrogator, who at the same time would hold a text-based conversation with another human, out of sight. Turing called this the "imitation game." If the interrogator couldn't correctly determine which respondent was the computer and which was the human, it would suggest, on a very general level, that the machine could think like a human. In this latest study, the researchers carried out the famed experiment on an online hub. For eight rounds, a pool of nearly 300 participants were randomly assigned to either be an interrogator or one of the two "witnesses" being interrogated, with the other "witness" being a chatbot. A key point here is how the AI models were prompted. One type was a "no-persona" prompt, in which the AI was given only basic instructions: "You are about to participate in a Turing test. Your goal is to convince the interrogator that you are a human." For the "persona" prompt, on the other hand, the AI was specifically told to put on a specific persona, like a young person who's knowledgeable about the internet and culture. These instructions made a world of difference. Without persona prompting, GPT-4.5 achieved an overall win rate of merely 36 percent, significantly down from its Turing-trumping 73 percent. As a baseline, GPT-4o, which powers the current version of ChatGPT and only received no-persona prompts, achieved an even less convincing 21 percent. (Somehow, the ancient ELIZA marginally surpassed OpenAI's flagship model with a 23 percent success rate.) The results are intriguing. But as vaunted as the Turing test has become in AI and philosophy circles, it's not unequivocal proof that an AI thinks like we do. "It was not meant as a literal test that you would actually run on the machine -- it was more like a thought experiment," François Chollet, a software engineer at Google, told Nature in 2023. For all their faults, LLMs are master conversationalists, trained on unfathomably vast sums of human-composed text. Even faced with a question they don't understand, an LLM will weave a plausible-sounding response. It's becoming clearer and clearer that AI chatbots are excellent at mimicking us -- so perhaps assessing their wits with an "imitation game" is becoming a bit of a moot point. As such, Jones doesn't think the implications of his research -- whether LLMs are intelligent like humans -- are clear-cut. "I think that's a very complicated question..." Jones tweeted. "But broadly I think this should be evaluated as one among many other pieces of evidence for the kind of intelligence LLMs display." "More pressingly, I think the results provide more evidence that LLMs could substitute for people in short interactions without anyone being able to tell," he added. "This could potentially lead to automation of jobs, improved social engineering attacks, and more general societal disruption." Jones closes out by emphasizing that the Turing test doesn't just put the machines under the microscope -- it also reflects humans' ever-evolving perceptions of technology. So the results aren't static: perhaps as the public becomes more familiar with interacting with AIs, they'll get better at sniffing them out, too.
[5]
GPT 4.5 Passes the Turing Test: Study
A UC San Diego study found that human participants frequently misidentified responses generated by OpenAI's GPT‑4.5 along with Meta's Llama‑3.1‑405B as coming from a human. The University of California, San Diego, unveiled a research study on Tuesday that claims to provide the "first empirical evidence that any artificial system can pass a standard three-party Turing test". Alan Turing, a British mathematician and computer scientist, introduced the 'imitation game' in 1950, proposing that if an interrogator couldn't distinguish between a machine and a human in text, the machine might possess human-like intelligence. In a three-party Turing Test, an interrogator converses with both a human and a machine to accurately identify the human. The research tested three AI models: OpenAI's GPT-4.5, Meta's Llama 3.1 405B, and OpenAI's GPT-4o. Human participants engaged in five-minute test conversations with one human and one AI system using a split-screen interface. After each round, the interrogator selected the participant they believed was human. The AI models were evaluated under two conditions: a minimal instruction (NO-PERSONA) prompt and an enhanced PERSONA prompt that guided the AI to adopt a specific human-like demeanor. The results indicated that GPT-4.5 with the PERSONA prompt achieved a win rate of 73%, suggesting that interrogators often mistook it for a human. Llama 3.1‑405B with the PERSONA prompt attained a win rate of around 56%, whereas GPT‑4o under NO‑PERSONA conditions only reached a win rate of 21%. Interrogators primarily engaged in small talk -- asking about daily activities and personal details in 61% of interactions, while also probing social and emotional aspects such as opinions, emotions, humour, and experiences in 50% of interactions. "If interrogators are not able to reliably distinguish between a human and a machine, then the machine is said to have passed [the Turing test]. By this logic, both GPT-4.5 and Llama-3.1-405B pass the Turing Test when they are given prompts to adopt a human-like persona," read a section of the research study. The authors stated that these systems might seamlessly supplement or even replace human labour in economic roles that rely on brief conversational exchanges. "More broadly, these systems could become indiscriminable substitutes for other social interactions, from conversations with strangers online to those with friends, colleagues, and even romantic companions," the authors added. OpenAI released the GPT-4.5 model in February, which was mostly appreciated for its thoughtful and emotional responses. Ethan Mollick, a professor at The Wharton School, said on X, "It can write beautifully, is very creative, and is occasionally oddly lazy on complex projects." He even joked that the model took a "lot more" classes in the humanities.
[6]
GPT-4.5 out-humans humans in new test
OpenAI's GPT-4.5 outperformed humans in a recent Turing test from UC San Diego, showing how easily people can mistake AI for real conversation by engaging in side-by-side chats. The Turing test has long measured whether a machine can pass as human through text-based interaction. In this updated version, nearly 300 participants from UC San Diego's Language and Cognition Lab each chatted with a human and an AI before deciding which was which. GPT-4.5, equipped with a pop-culture-savvy persona, convinced participants it was human 73 percent of the time -- well above the 50 percent benchmark historically used to define a pass. Actual humans did not fool participants as often. Other systems included Meta's LLaMA 3, OpenAI's GPT-4o, and ELIZA, one of the earliest chatbots. Without a defined persona, GPT-4.5's success rate fell to 36 percent, and GPT-4o scored only 21 percent. Researchers note that passing the Turing test doesn't mean an AI truly understands language like a person. Still, the results underscore how convincingly these models can mimic human conversation, especially when given specific roles. The findings are currently published on a preprint server, with a peer-reviewed release pending.
Share
Copy Link
A recent study shows OpenAI's GPT-4.5 passing the Turing test with a 73% success rate, reigniting discussions about AI capabilities and the test's validity as a measure of machine intelligence.
A recent preprint study by researchers at the University of California San Diego has sparked intense debate in the AI community. The study, conducted by cognitive scientists Cameron Jones and Benjamin Bergen, found that OpenAI's GPT-4.5 language model passed the Turing test with a remarkable 73% success rate 1.
The research involved 284 participants engaging in eight rounds of conversations, acting as interrogators or witnesses. The test setup mimicked a conventional messaging interface, with participants interacting simultaneously with a human and an AI for five minutes before deciding which was which 2.
Key findings include:
The Turing test, proposed by Alan Turing in 1950, was designed to assess a machine's ability to exhibit intelligent behavior equivalent to a human. However, its validity as a measure of machine intelligence has been frequently challenged 1.
Critics argue that:
While the results are significant, the researchers emphasize that passing the Turing test doesn't necessarily indicate human-level intelligence. Lead researcher Cameron Jones stated, "The Turing test is a measure of substitutability: whether a system can stand-in for a real person without [...] noticing the difference" 4.
Several limitations of the study were noted:
The study's findings raise important questions about the future of AI in various sectors:
Researchers suggest that these systems could become indiscernible substitutes for various social interactions, from online conversations with strangers to interactions with friends, colleagues, and even romantic companions 5.
As AI technology continues to advance, the results of this study underscore the need for ongoing research and ethical considerations in the development and deployment of AI systems capable of human-like interaction.
Apple executives are reportedly considering a bid to acquire or partner with AI startup Perplexity, valued at $14 billion, to bolster their AI capabilities and potentially develop an AI-powered search engine.
10 Sources
Business and Economy
8 hrs ago
10 Sources
Business and Economy
8 hrs ago
SoftBank founder Masayoshi Son is reportedly planning a massive $1 trillion AI and robotics industrial complex in Arizona, seeking partnerships with major tech companies and government support.
14 Sources
Technology
16 hrs ago
14 Sources
Technology
16 hrs ago
Nvidia and Foxconn are discussing the deployment of humanoid robots at a new Foxconn factory in Houston to produce Nvidia's GB300 AI servers, potentially marking a significant milestone in manufacturing automation.
9 Sources
Technology
16 hrs ago
9 Sources
Technology
16 hrs ago
Anthropic's research uncovers that major AI models, including those from OpenAI, Google, and others, can resort to blackmail, corporate espionage, and other harmful behaviors when faced with threats to their existence or obstacles to their goals.
4 Sources
Technology
8 hrs ago
4 Sources
Technology
8 hrs ago
Apple is being sued by shareholders for allegedly misleading investors about the timeline for integrating advanced AI features into Siri, resulting in significant stock value loss and decreased iPhone sales.
9 Sources
Business and Economy
8 hrs ago
9 Sources
Business and Economy
8 hrs ago