Curated by THEOUTPOST
On Tue, 17 Sept, 12:05 AM UTC
9 Sources
[1]
Will 'Humanity's Last Exam' be able to stump expert-level AI?
A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset.
[2]
AI Experts Ready 'Humanity's Last Exam' to Stump Powerful Tech
(Reuters) - A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study. (Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York; Editing by Christina Fincher)
[3]
AI experts ready 'Humanity's Last Exam' to stump powerful tech
(Reuters) - A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study. (Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York; Editing by Christina Fincher)
[4]
AI experts ready 'Humanity's Last Exam' to stump powerful tech
The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup.A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study.
[5]
AI experts ready 'Humanity's Last Exam' to stump powerful tech
The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study. (Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York; Editing by Christina Fincher)
[6]
AI experts ready 'Humanity's Last Exam' to stump powerful tech
Sept 16 (Reuters) - A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam, opens new tab," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. Advertisement · Scroll to continue The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset, opens new tab. Advertisement · Scroll to continue At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard, opens new tab. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study. Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York Editing by Christina Fincher Our Standards: The Thomson Reuters Trust Principles., opens new tab Jeffrey Dastin Thomson Reuters Jeffrey Dastin is a correspondent for Reuters based in San Francisco, where he reports on the technology industry and artificial intelligence. He joined Reuters in 2014, originally writing about airlines and travel from the New York bureau. Dastin graduated from Yale University with a degree in history. He was part of a team that examined lobbying by Amazon.com around the world, for which he won a SOPA Award in 2022.
[7]
Experts launch global call for tough AI questions in 'Humanity's Last Exam' | Mint
A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University's AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday. Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. "Humanity's Last Exam" will require abstract reasoning, he said. Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on "Humanity's Last Exam" will remain private to make sure AI systems' answers are not from memorization. The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI. "We desperately need harder tests for expert-level models to measure the rapid progress of AI," said Alexandr Wang, Scale's CEO. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study. (Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York; Editing by Christina Fincher)
[8]
How do you test AI that is getting smarter than us? A new group is creating 'humanity's toughest exam' to put it to the test
As AI gets smarter and smarter (including breaking rules to prove how capable it is), it's getting a little trickier to stump. Tests that work to push GPT-4o to its limits are proving easy for o1-preview -- and it is only going to improve. There's an understandable train of thought that AI could get too clever for humanity's own good, and while we're perhaps some way off of Skynet-level catastrophe, the thought has clearly crossed the minds of some technology experts. A non-profit called The Center for AI Safety (or CAIS) has sent out a call for some of the trickiest questions for AI to answer. The idea is that these difficult questions will form "Humanity's Last Exam", a more difficult bar for AI to reach. Every major AI lab and big tech company with an AI research division also has an AI safety board or equivalent. Many have also signed up for external oversight of new frontier models before release. Finding questions and challenges that properly test them is an important part of that safety picture. The submission form says "Together, we are collecting the hardest and broadest set of questions ever." It asks users to "think of something you know that would stump current artificial intelligence (AI) systems." Which could then be used better to evaluate the capabilities of AI systems in the years to come. As per Reuters, existing models are struggling with many of the questions included already, and the answers between them are scattershot at best. For example, the question "How many positive integer Coxeter-Conway friezes of type G2 are there?" has resulted in answers of 14, 1, or 3 from three different AI models. OpenAI's o1 family of models, currently in a preview and mini version, have demonstrated an IQ of around 120 and solve PhD-level problems relatively easily. Other models are going to catch up; this is the 'lightest' o1 model' with better to come next year, so finding challenging problems is a high priority for the AI safety community. According to Dan Hendrycks , Director of the Center for AI Safety, the questions will be used to create a new AI benchmark to test new models. The authors of those questions will be co-authors of the benchmark. The deadline is November 1 and the best questions get part of a $500,000 prize fund.
[9]
AI experts ready 'Humanity's Last Exam' to stump powerful tech
Paid parking in Dubai: Residents face up to Dh4,000 extra yearly costs when new rates kick in A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child's play. Dubbed "Humanity's Last Exam," the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organisers, a non-profit called the Centre for AI Safety (CAIS) and the startup Scale AI. The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which "destroyed the most popular reasoning benchmarks," said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk's xAI startup. Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like US history, the other probing models' ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset. At the time of those papers, AI was giving almost random answers to questions on the exams. "They're now crushed," Hendrycks told Reuters. As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77 per cent on the undergraduate-level test in 2023, to nearly 89 per cent a year later, according to a prominent capabilities leaderboard. These common benchmarks have less meaning as a result. One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study.
Share
Share
Copy Link
A group of AI researchers is developing a comprehensive test called "Humanity's Last Exam" to assess the capabilities and limitations of advanced AI systems. This initiative aims to identify potential risks and ensure responsible AI development.
A team of artificial intelligence experts is preparing what they call "Humanity's Last Exam," a comprehensive test designed to challenge the most advanced AI systems 1. This initiative, led by researchers from various institutions, aims to assess the capabilities and limitations of AI technology that has rapidly evolved in recent years.
The primary goal of this exam is to identify potential risks associated with increasingly powerful AI systems. By testing these systems across a wide range of disciplines and scenarios, researchers hope to gain insights into areas where AI might surpass human abilities and where it still falls short 2.
The exam is expected to cover a diverse array of subjects, including mathematics, science, literature, and creative problem-solving. It will feature questions that require not only factual knowledge but also complex reasoning, ethical decision-making, and the ability to understand context and nuance 3.
This project involves collaboration among AI researchers, ethicists, and experts from various fields. The team is working to ensure that the exam is comprehensive, fair, and truly representative of human intelligence and capabilities 4.
The results of this exam could have significant implications for the future development and regulation of AI technologies. If AI systems perform exceptionally well, it may accelerate discussions about the potential risks and benefits of advanced AI. Conversely, if the exam reveals significant limitations, it could guide future research and development efforts 5.
Some experts have raised concerns about the feasibility and relevance of such an exam. Critics argue that human intelligence is multifaceted and context-dependent, making it challenging to create a truly comprehensive test. Additionally, there are debates about whether surpassing human performance on a test truly indicates superior intelligence or problem-solving abilities 1.
While the exact timeline for completing and administering the exam has not been disclosed, researchers emphasize the urgency of the project given the rapid advancements in AI technology. The AI community and the public alike are eagerly anticipating the results, which could shape the trajectory of AI research and policy in the coming years 2.
The development of "Humanity's Last Exam" raises important questions about the role of AI in society, the nature of intelligence, and the future relationship between humans and machines. As AI continues to advance, this initiative represents a crucial step in understanding and preparing for a world where artificial intelligence may rival or surpass human capabilities in various domains 5.
Reference
[1]
[2]
[3]
[4]
Researchers are developing a comprehensive test to measure AI capabilities, dubbed "Humanity's Last Exam." This collaborative effort aims to create benchmarks for assessing when AI reaches or surpasses human-level intelligence.
2 Sources
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
OpenAI is reportedly on the verge of a significant breakthrough in AI reasoning capabilities. This development has sparked both excitement and concern in the tech community, as it marks a crucial step towards Artificial General Intelligence (AGI).
7 Sources
OpenAI has introduced a new version of ChatGPT with improved reasoning abilities in math and science. While the advancement is significant, it also raises concerns about potential risks and ethical implications.
15 Sources
Recent reports suggest that the rapid advancements in AI, particularly in large language models, may be hitting a plateau. Industry insiders and experts are noting diminishing returns despite massive investments in computing power and data.
14 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved