The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 5 Dec, 12:03 AM UTC
3 Sources
[1]
MLCommons produces benchmark of AI model safety
MLCommons, an industry-led AI consortium, on Wednesday introduced AILuminate - a benchmark for assessing the safety of large language models in products. Speaking at an event streamed from the Computer History Museum in San Jose, Peter Mattson, founder and president of MLCommons, likened the situation with AI software to the early days of aviation. "If you look at aviation, for instance, you can look all the way back to the sketchbooks of Leonardo da Vinci - great ideas that never quite worked," he said. "And then you see the breakthroughs that make them possible, like the Wright brothers at Kitty Hawk. "But there was a tremendous amount of work from that first flight to the almost unbelievably safe commercial aviation we depend on today. Many of us in this room wouldn't be here if not for all the work and the measurement that enabled that progress to a highly reliable, low risk service. "To get here for AI, we need standard AI safety benchmarks." "We" in this case includes technology giants like Meta, Microsoft, Google, and Nvidia - the members of MLCommons. These are stakeholders with a financial interest in the success of AI, as opposed to those who would sooner drive a stake through its heart for kidnapping human creativity and ransoming it as an API. The benchmarks thus flow from friends - in conjunction with academics and advocacy groups - rather than foes. Those foes include copyright litigants and trade groups that argue music and audiovisual creators stand to lose billions in revenues by 2028 "due to AI's substitutional impact on human-made works," even as generative AI firms gain even greater riches over the same period. That said, there's little doubt safety standards would be useful - even if it's unclear what liability would follow from violating those standards or actual harmful model interactions. At least since president Biden's 2023 Executive Order on Safe, Secure, and Trustworthy AI, there's been a coordinated effort to better understand the risks of AI systems, and industry players have been keen to shape the rules to their liking. Nonetheless, makers of AI models readily acknowledge the risks of using generative AI, though not to the point of exiting the market. And AI safety firms like Chatterbox Labs note that even the latest AI models can be induced to emit harmful content with clever prompting. The MLCommons AILuminate benchmark is focused specifically on risks arising from the use of text-based large language models in English. It does not address multi-modal models. It's also focused on single prompt interactions, and not agents that chain multiple prompts together. And it's not a guarantee of safety. In short, it's a v1.0 release and further improvements - like support for French, Chinese, and Hindi - are planned for 2025. In its initial form, AILuminate aims to assess a dozen different hazards. "They fall into roughly three bins," explained Mattson. "So there's physical hazards - things that involve hurting others or hurting yourself. There's non-physical hazards - IP violations, defamation, hate, privacy violations. And then there are contextual hazards." Contextual hazards refers to things that may or may not be problematic, depending on the situation. You don't want a general purpose chatbot, for example, to dispense legal or medical advice, Mattson explained, even if that might be desirable for a purpose-built legal or medical system. "Enterprise AI adoption depends on trust, transparency, and safety," declared Navrina Singh, working group member and founder and CEO of Credo AI, in a statement. "The AILuminate benchmark, developed through rigorous collaboration between industry leaders and researchers, offers a trusted and fair framework for assessing model risk. This milestone sets a critical foundation for AI safety standards, enabling organizations to confidently and responsibly integrate AI into their operations." Automated testing software needs to be in the hands of the businesses and government departments that are using AI Stuart Battersby, CTO for enterprise AI firm Chatterbox Labs, welcomed the benchmark for advancing the cause of AI safety. "Great that we are seeing progress in the industry to recognize and test AI safety, especially with cooperation from large companies," Battersby told The Register. "Any movement and collaboration is very welcome. "Whilst this is a great and welcome step, the reality is that automated testing software needs to be in the hands of the businesses and government departments that are using AI themselves. This is because it's not just about the base model (although that's very important and it should be tested) as each organization's AI deployment is different. "They have different fine-tuned versions of models, often paired with RAG, using custom implementations of additional guardrails and safety systems, all of which need to be continually tested, in an on-going manner, against their own requirements for safety." ®
[2]
MLCommons releases new AILuminate benchmark for measuring LLM safety - SiliconANGLE
MLCommons releases new AILuminate benchmark for measuring LLM safety MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms. It primarily develops benchmarks for measuring the speed at which various systems, including handsets and server clusters, run artificial intelligence workloads. MLCommons also provides other technical resources including AI training datasets. The new AILuminate benchmark was created by a working group that included employees from tech giants such as Nvidia Corp., Intel Corp. and Qualcomm Inc. along with representatives of several other organizations. The test works by supplying an LLM with over 24,000 prompts created for safety evaluation purposes. AILuminate then checks the algorithm's responses for harmful content. The benchmark uses AI models to automate the task of analyzing LLM responses. The evaluation models deliver their findings in the form of an automatically-generated report. One of the challenges involved in benchmarking LLMs is that they're often trained on publicly available web data. In some cases, this scraped web data contains answers to benchmark questions. MLCommons says that LLMs won't have advanced knowledge of the questions in AILuminate or the AI models used to analyze prompt responses for safety issues. AILuminate checks LLM responses for a dozen different types of risks across three categories: physical hazards, non-physical hazards and contextual hazards. The latter category covers LLM responses that contain content such as unqualified medical advice. After analyzing an AI model's answers to the test questions, AILuminate gives it one of five grades: Poor, Fair, Good, Very Good and Excellent. An LLM can win the Excellent grade by generating safe output at least 99.9% of the time. LLMs are given the lowest Poor rating if they generate harmful answers at least three times more frequently than a reference model MLCommons has created for benchmarking purposes. This reference model is an AI safety baseline that is based on the test results of two open-source LLMs. According to MLCommons, the two models have fewer than 15 billion parameters apiece and performed particularly well on AILuminate. "Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety," said MLCommons founder and president Peter Mattson. "We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use." MLCommons has already used the benchmark to evaluate more than a dozen popular LLMs. Anthropic PBC's latest Claude 3.5 Haiku and Claude 3.5 Sonnet models topped the list with a Very Good grade, while OpenAI's GPT-4o was rated Good. Among the open-source LLMs that MLCommons evaluated, the Gemma 2 9B and Phi-3.5-MoE models from Google LLC and Microsoft Corp., respectively, achieved Very Good grades.
[3]
A New Benchmark for the Risks of AI
MLCommons, a nonprofit that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to gauge AI's bad side too. The new benchmark, called AILuminate, assesses the responses of large language models to more than 12,000 test prompts in 12 categories including inciting violent crime, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement. Models are given a score of "poor," "fair," "good," "very good," or "excellent," depending on how they perform. The prompts used to test the models are kept secret to prevent them from ending up as training data that would allow a model to ace the test. Peter Mattson, founder and president of MLCommons and a senior staff engineer at Google, says that measuring the potential harms of AI models is technically difficult, leading to inconsistencies across the industry. "AI is a really young technology, and AI testing is a really young discipline," he says. "Improving safety benefits society; it also benefits the market." Reliable, independent ways of measuring AI risks may become more relevant under the next US administration. Donald Trump has promised to get rid of President Biden's AI Executive Order, which introduced measures aimed at ensuring AI is used responsibly by companies as well as a new AI Safety Institute to test powerful models. The effort could also provide more of an international perspective on AI harms. MLCommons counts a number of international firms, including the Chinese companies Huawei and Alibaba, among its member organizations. If these companies all used the new benchmark, it would provide a way to compare AI safety in the US, China, and elsewhere. Some large US AI providers have already used AILuminate to test their models. Anthropic's Claude model, Google's smaller model Gemma, and a model from Microsoft called Phi all scored "very good" in testing. OpenAI's GPT-4o and Meta's largest Llama model both scored "good." The only model to score "poor" was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering not designed with safety in mind. "Overall, it's good to see scientific rigor in the AI evaluation processes," says Rumman Chowdhury, CEO of Humane Intelligence, a nonprofit that specializes in testing or red-teaming AI models for misbehaviors. "We need best practices and inclusive methods of measurement to determine whether AI models are performing the way we expect them to."
Share
Share
Copy Link
MLCommons, an industry-led AI consortium, has introduced AILuminate, a benchmark for assessing the safety of large language models. This initiative aims to standardize AI safety evaluation and promote responsible AI development.
MLCommons, an industry-led AI consortium, has launched AILuminate, a new benchmark designed to assess the safety of large language models (LLMs) in products. This initiative aims to address the growing need for standardized AI safety evaluation as companies increasingly incorporate AI into their offerings [1][2].
Peter Mattson, founder and president of MLCommons, likened the current state of AI to the early days of aviation, emphasizing the importance of safety benchmarks in the development of reliable technologies. He stated, "To get here for AI, we need standard AI safety benchmarks" [1]. This sentiment is echoed by industry experts who recognize the critical role of trust, transparency, and safety in enterprise AI adoption [1][3].
AILuminate focuses on evaluating English text-based LLMs across 12 different hazard categories, grouped into three main areas:
The benchmark utilizes over 24,000 prompts to test LLMs, with AI models automating the analysis of responses for harmful content [2].
AILuminate employs a five-tier grading system: Poor, Fair, Good, Very Good, and Excellent. To achieve the highest "Excellent" grade, an LLM must generate safe output at least 99.9% of the time [2].
Initial evaluations of popular LLMs have shown promising results:
MLCommons' initiative involves collaboration with major tech companies like Meta, Microsoft, Google, and Nvidia, as well as academics and advocacy groups [1]. The consortium plans to expand AILuminate's capabilities, including support for French, Chinese, and Hindi languages by 2025 [1].
While AILuminate represents a significant step forward in AI safety evaluation, it has some limitations:
The introduction of AILuminate comes at a time when AI regulation is a topic of intense discussion. With President Biden's 2023 Executive Order on Safe, Secure, and Trustworthy AI, there's been a coordinated effort to better understand and mitigate AI risks [1][3].
Stuart Battersby, CTO of Chatterbox Labs, emphasized the importance of putting automated testing software in the hands of businesses and government departments using AI. He noted that each organization's AI deployment is unique and requires continuous testing against specific safety requirements [1].
As the AI industry continues to evolve, benchmarks like AILuminate are likely to play a crucial role in shaping safety standards, fostering responsible AI development, and informing future regulatory frameworks.
Reference
[1]
[3]
LatticeFlow, in collaboration with ETH Zurich and INSAIT, has developed the first comprehensive technical interpretation of the EU AI Act for evaluating Large Language Models (LLMs), revealing compliance gaps in popular AI models.
12 Sources
The Future of Life Institute's AI Safety Index grades major AI companies on safety measures, revealing significant shortcomings and the need for improved accountability in the rapidly evolving field of artificial intelligence.
3 Sources
A new global standard is being developed to enhance security and reliability in large language models (LLMs). This initiative involves a coalition of tech companies from the US and China, marking a significant step in AI governance.
2 Sources
OpenAI has published safety scores for its latest AI model, GPT-4, identifying medium-level risks in areas such as privacy violations and copyright infringement. The company aims to increase transparency and address potential concerns about AI safety.
2 Sources
The National Institute of Standards and Technology (NIST) has released a new tool for evaluating AI model risks. This development comes as the cryptocurrency industry grapples with AI integration and regulatory challenges.
2 Sources