2 Sources
[1]
Evaluating AI language models just got more effective and efficient
Assessing the progress of new AI language models can be as challenging as training them. Stanford researchers offer a new approach. As new versions of artificial intelligence language models roll out with increasing frequency, many do so with claims of improved performance. Demonstrating that a new model is actually better than the last, however, remains an elusive and expensive challenge for the field. Typically, to prove their mettle and improve trust that new models are indeed better, developers subject new models to a battery of benchmark questions. Potentially hundreds of thousands of such benchmark questions are stored in question banks, and the answers must be reviewed by humans, adding time and cost to the process. Practical constraints make it impossible to ask every model every benchmark question, so developers choose a subset, introducing the risk of overestimating improvements based on softer questions. Stanford researchers have now introduced a cost-effective way to do these evaluations in a new paper published at the International Conference on Machine Learning. "The key observation we make is that you must also account for how hard the questions are," said Sanmi Koyejo, an assistant professor of computer science in the School of Engineering who led the research. "Some models may do better or worse just by luck of the draw. We're trying to anticipate that and adjust for it to make fairer comparisons." "This evaluation process can often cost as much or more than the training itself," added co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab (SAIL). "We've built an infrastructure that allows us to adaptively select subsets of questions based on difficulty. It levels the playing field." To achieve their goal, Koyejo, Truong, and colleagues have borrowed a decades-old concept from education, known as Item Response Theory, which takes into account question difficulty when scoring test-takers. Koyejo compares it to the way standardized tests like the SAT and other kinds of adaptive testing work. Every right or wrong answer changes the question that follows. The researchers use language models to analyze questions and score them on difficulty, reducing the costs by half and in some cases by more than 80%. That difficulty score allows the researchers to compare the relative performance of two models. To construct a large, diverse, and well-calibrated question bank in a cost-effective way, the researchers use AI's generative powers to create a question generator that can be fine-tuned to any desired level of difficulty. This helps automate the replenishing of question banks and the culling of "contaminated" questions from the database. With better-designed questions, the authors say, others in the field can make better performance evaluations with a far smaller subset of queries. This approach is faster, fairer, and less expensive. The new approach also works across knowledge domains - from medicine and mathematics to law. Koyejo has tested the system against 22 datasets and 172 language models and found that it can adapt easily to both new models and questions. Their approach was able to chart subtle shifts in GPT 3.5's safety over time, at first getting better and then retreating in several variations tested in 2023. Language model safety is a metric of how robust a model is to data manipulation, adversarial attacks, exploitation, and other risks. Where once reliably evaluating language models was an expensive and inconsistent prospect, the new Item Response Theory approach puts rigorous, scalable, and adaptive evaluation within reach. For developers, this means better diagnostics and more accurate performance evaluations. For users, it means fairer and more transparent model assessments. "And, for everyone else," Koyejo said. "It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence."
[2]
New method makes AI language model evaluations faster, fairer, and less costly
Assessing the progress of new AI language models can be as challenging as training them. Stanford researchers offer a new approach. As new versions of artificial intelligence language models roll out with increasing frequency, many do so with claims of improved performance. Demonstrating that a new model is actually better than the last, however, remains an elusive and expensive challenge for the field. Typically, to prove their mettle and improve trust that new models are indeed better, developers subject new models to a battery of benchmark questions. Potentially hundreds of thousands of such benchmark questions are stored in question banks, and the answers must be reviewed by humans, adding time and cost to the process. Practical constraints make it impossible to ask every model every benchmark question, so developers choose a subset, introducing the risk of overestimating improvements based on softer questions. Stanford researchers have now introduced a cost-effective way to do these evaluations in a new paper presented at the International Conference on Machine Learning (ICML 2025). The study is available on the arXiv preprint server. "The key observation we make is that you must also account for how hard the questions are," said Sanmi Koyejo, an assistant professor of computer science in the School of Engineering who led the research. "Some models may do better or worse just by luck of the draw. We're trying to anticipate that and adjust it to make fairer comparisons." "This evaluation process can often cost as much or more than the training itself," added co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab (SAIL). "We've built an infrastructure that allows us to adaptively select subsets of questions based on difficulty. It levels the playing field." Apples and oranges To achieve their goal, Koyejo, Truong, and colleagues have borrowed a decades-old concept from education, known as Item Response Theory, which takes into account question difficulty when scoring test-takers. Koyejo compares it to the way standardized tests like the SAT and other kinds of adaptive testing work. Every right or wrong answer changes the question that follows. The researchers use language models to analyze questions and score them on difficulty, reducing the costs by half and in some cases by more than 80%. That difficulty score allows the researchers to compare the relative performance of two models. To construct a large, diverse, and well-calibrated question bank in a cost-effective way, the researchers use AI's generative powers to create a question generator that can be fine-tuned to any desired level of difficulty. This helps automate the replenishing of question banks and the culling of "contaminated" questions from the database. Fast and fair With better-designed questions, the authors say, others in the field can make better performance evaluations with a far smaller subset of queries. This approach is faster, fairer, and less expensive. The new approach also works across knowledge domains -- from medicine and mathematics to law. Koyejo has tested the system against 22 datasets and 172 language models and found that it can adapt easily to both new models and questions. Their approach was able to chart subtle shifts in GPT 3.5's safety over time, at first getting better and then retreating in several variations tested in 2023. Language model safety is a metric of how robust a model is to data manipulation, adversarial attacks, exploitation, and other risks. Where once reliably evaluating language models was an expensive and inconsistent prospect, the new Item Response Theory approach puts rigorous, scalable, and adaptive evaluation within reach. For developers, this means better diagnostics and more accurate performance evaluations. For users, it means fairer and more transparent model assessments. "And, for everyone else," Koyejo said. "It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence."
Share
Copy Link
Stanford researchers introduce a new, cost-effective method for evaluating AI language models using Item Response Theory, making the process faster, fairer, and less expensive.
Researchers at Stanford University have developed a groundbreaking method for evaluating artificial intelligence (AI) language models, addressing the challenges of cost and efficiency in the rapidly evolving field of AI. The new approach, presented at the International Conference on Machine Learning (ICML 2025), promises to make the evaluation process faster, fairer, and significantly less expensive 12.
As AI language models continue to advance at an unprecedented pace, developers face the daunting task of proving that new iterations are indeed improvements over their predecessors. Traditionally, this involves subjecting models to extensive batteries of benchmark questions, a process that can be as costly and time-consuming as the model training itself 1.
Sanmi Koyejo, an assistant professor of computer science at Stanford's School of Engineering, explains the core issue: "The key observation we make is that you must also account for how hard the questions are. Some models may do better or worse just by luck of the draw. We're trying to anticipate that and adjust for it to make fairer comparisons" 12.
To address these challenges, the Stanford team has adapted Item Response Theory, a concept borrowed from educational testing, to the realm of AI evaluation. This approach takes into account the difficulty of questions when assessing model performance, similar to how adaptive standardized tests like the SAT function 1.
Source: Stanford News
The researchers use AI language models to analyze and score questions based on difficulty. This innovative method has shown remarkable results, reducing evaluation costs by 50% to 80% in some cases 2.
A key component of the new system is its ability to generate and calibrate questions automatically. The researchers have developed an AI-powered question generator that can be fine-tuned to produce questions of varying difficulty levels. This not only helps in replenishing question banks but also in removing potentially contaminated or outdated questions 12.
The new evaluation approach has demonstrated impressive versatility across different knowledge domains, including medicine, mathematics, and law. Koyejo and his team have rigorously tested the system against 22 datasets and 172 language models, proving its adaptability to both new models and questions 12.
Source: Tech Xplore
This innovative evaluation method has far-reaching implications for the AI industry. For developers, it offers more accurate performance evaluations and better diagnostic tools. Users can expect fairer and more transparent model assessments 1.
Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab (SAIL) and co-author of the study, emphasizes the significance of this development: "This evaluation process can often cost as much or more than the training itself. We've built an infrastructure that allows us to adaptively select subsets of questions based on difficulty. It levels the playing field" 12.
The Stanford team's approach has already shown practical utility in tracking the safety of language models over time. They were able to chart subtle shifts in GPT 3.5's safety metrics throughout 2023, revealing initial improvements followed by some regression in various tested variations 12.
As the field of AI continues to advance at a rapid pace, this new evaluation method stands to play a crucial role in ensuring more rigorous, scalable, and adaptive assessments of language models. Koyejo concludes, "And, for everyone else, it will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence" 12.
Summarized by
Navi
[1]
Google launches its new Pixel 10 smartphone series, showcasing advanced AI capabilities powered by Gemini, aiming to challenge competitors in the premium handset market.
20 Sources
Technology
7 hrs ago
20 Sources
Technology
7 hrs ago
Google's Pixel 10 series introduces groundbreaking AI features, including Magic Cue, Camera Coach, and Voice Translate, powered by the new Tensor G5 chip and Gemini Nano model.
12 Sources
Technology
7 hrs ago
12 Sources
Technology
7 hrs ago
NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather with improved accuracy, potentially helping to protect Earth's infrastructure from solar storm damage.
6 Sources
Technology
15 hrs ago
6 Sources
Technology
15 hrs ago
Google's latest smartwatch, the Pixel Watch 4, introduces significant upgrades including a curved display, enhanced AI features, and improved health tracking capabilities.
17 Sources
Technology
7 hrs ago
17 Sources
Technology
7 hrs ago
FieldAI, a robotics startup, has raised $405 million to develop "foundational embodied AI models" for various robot types. The company's innovative approach integrates physics principles into AI, enabling safer and more adaptable robot operations across diverse environments.
7 Sources
Technology
7 hrs ago
7 Sources
Technology
7 hrs ago