Curated by THEOUTPOST
On Wed, 16 Oct, 8:08 AM UTC
12 Sources
[1]
Researchers provide LLM benchmarking suite for the EU Artificial Intelligence Act
Researchers from ETH Zurich, the Bulgarian AI research institute INSAIT -- created in partnership with ETH and EPFL -- and the ETH spin-off LatticeFlow AI have provided the first comprehensive technical interpretation of the EU AI Act for General Purpose AI (GPAI) models. This makes them the first to translate the legal requirements that the EU places on future AI models into concrete, measurable and verifiable technical requirements. Such a translation is very relevant for the further implementation process of the EU AI Act: The researchers present a practical approach for model developers to see how aligned they are with future EU legal requirements. Such a translation from regulatory high-level requirements down to actually runnable benchmarks has not existed so far and thus can serve as an important reference point for both model training as well as the currently developing EU AI Act Code of Practice. The researchers tested their approach on 12 popular generative AI models such as ChatGPT, Llama, Claude or Mistral -- after all, these large language models (LLMs) have contributed enormously to the growing popularity and distribution of artificial intelligence (AI) in everyday life, as they are very capable and intuitive to use. With the increasing distribution of these -- and other -- AI models, the ethical and legal requirements for the responsible use of AI are also increasing: for example, sensitive questions arise regarding data protection, privacy protection and the transparency of AI models. Models should not be "black boxes" but rather deliver results that are as explainable and traceable as possible. Implementation of the AI Act must be technically clear Furthermore, they should function fairly and not discriminate against anyone. Against this backdrop, the EU AI Act, which the EU adopted in March 2024, is the world's first AI legislative package that comprehensively seeks to maximize public trust in these technologies and minimize their undesirable risks and side effects. "The EU AI Act is an important step towards developing responsible and trustworthy AI," says ETH computer science professor Martin Vechev, head of the Laboratory for Safe, Reliable and Intelligent Systems and founder of INSAIT, "but so far we lack a clear and precise technical interpretation of the high-level legal requirements from the EU AI Act. "This makes it difficult both to develop legally compliant AI models and to assess the extent to which these models actually comply with the legislation." The EU AI Act sets out a clear legal framework to contain the risks of so-called General Purpose Artificial Intelligence (GPAI). This refers to AI models that are capable of executing a wide range of tasks. However, the act does not specify how the broad legal requirements are to be interpreted technically. The technical standards are still being developed until the regulations for high-risk AI models come into force in August 2026. "However, the success of the AI Act's implementation will largely depend on how well it succeeds in developing concrete, precise technical requirements and compliance-centered benchmarks for AI models," says Petar Tsankov, CEO and, with Vechev, a founder of the ETH spin-off LatticeFlow AI, which deals with the implementation of trustworthy AI in practice. "If there is no standard interpretation of exactly what key terms such as safety, explainability or traceability mean in (GP)AI models, then it remains unclear for model developers whether their AI models run in compliance with the AI Act," adds Robin Staab, Computer Scientist and doctoral student in Vechev's research group. Test of 12 language models reveals shortcomings The methodology developed by the ETH researchers offers a starting point and basis for discussion. The researchers have also developed a first "compliance checker," a set of benchmarks that can be used to assess how well AI models comply with the likely requirements of the EU AI Act. In view of the ongoing concretization of the legal requirements in Europe, the ETH researchers have made their findings publicly available in a study posted to the arXiv preprint server. They have also made their results available to the EU AI Office, which plays a key role in the implementation of and compliance with the AI Act -- and thus also for the model evaluation. In a study that is largely comprehensible even to non-experts, the researchers first clarify the key terms. Starting from six central ethical principles specified in the EU AI Act (human agency, data protection, transparency, diversity, non-discrimination, fairness), they derive 12 associated, technically clear requirements and link these to 27 state-of-the-art evaluation benchmarks. Importantly, they also point out in which areas concrete technical checks for AI models are less well-developed or even non-existent, encouraging both researchers, model providers, and regulators alike to further push these areas for an effective EU AI Act implementation. Impetus for further improvement The researchers applied their benchmark approach to 12 prominent language models (LLMs). The results make it clear that none of the language models analyzed today fully meet the requirements of the EU AI Act. "Our comparison of these large language models reveals that there are shortcomings, particularly with regard to requirements such as robustness, diversity, and fairness," says Staab. This also has to do with the fact that, in recent years, model developers and researchers primarily focused on general model capabilities and performance over more ethical or social requirements such as fairness or non-discrimination. However, the researchers have found that even key AI concepts such as explainability are unclear. In practice, there is a lack of suitable tools for subsequently explaining how the results of a complex AI model came about: What is not entirely clear conceptually is also almost impossible to evaluate technically. The study makes it clear that various technical requirements, including those relating to copyright infringement, cannot currently be reliably measured. For Staab, one thing is clear: "Focusing the model evaluation on capabilities alone is not enough." That said, the researchers' sights are set on more than just evaluating existing models. For them, the EU AI Act is a first case of how legislation will change the development and evaluation of AI models in the future. "We see our work as an impetus to enable the implementation of the AI Act and to obtain practicable recommendations for model providers," says Vechev, "but our methodology can go beyond the EU AI Act, as it is also adaptable for other, comparable legislation." "Ultimately, we want to encourage a balanced development of LLMs that takes into account both technical aspects such as capability and ethical aspects such as fairness and inclusion," adds Tsankov. The researchers are making their benchmark tool COMPL-AI available on a GitHub website to initiate the technical discussion. The results and methods of their benchmarking can be analyzed and visualized there. "We have published our benchmark suite as open source so that other researchers from industry and the scientific community can participate," says Tsankov.
[2]
LatticeFlow's LLM framework takes a first stab at benchmarking Big AI's compliance with EU AI Act
While most countries' lawmakers are still discussing how to put guardrails around artificial intelligence the European Union is ahead of the pack, having passed a risk-based framework for regulating AI apps earlier this year. The law came into force in August, although full details of the pan-EU AI governance regime are still being worked out -- Codes of Practice are in the process of being devised, for example -- but over the coming months and years the law's tiered provisions will start to apply on AI app and model makers so the compliance countdown is already live and ticking. Evaluating whether and how AI models are meeting their legal obligations is the next challenge. Large language models (LLM), and other so called foundation or general purpose AIs, will underpin most AI apps so focusing assessment efforts at this layer of the AI stack looks important. Step forward LatticeFlow AI, a spin out from ETH Zurich, which is focused on AI risk management and compliance. On Wednesday it published what it's touting as the first technical interpretation of the EU AI Act, meaning it's sought to map regulatory requirements to technical ones, alongside an open-source LLM validation framework that draws on this work -- which it's calling Compl-AI ('compl-ai'... see what they did there!). The AI model evaluation initiative -- which they also dub "the first regulation-oriented LLM benchmarking suite -- is the result of a long-term collaboration between the Swiss Federal Institute of Technology and Bulgaria's Institute for Computer Science, Artificial Intelligence and Technology (INSAIT), per LatticeFlow. AI model makers can use the Compl-AI site to request an evaluation of their technology's compliance with the requirements of the EU AI Act. LatticeFlow has also published model evaluations of several mainstream LLMs, such as different versions/sizes of Meta's Llama models and OpenAI's GPT, along with an EU AI Act compliance leaderboard for Big AI. The latter ranks the performance of models from the likes of Anthropic, Google, OpenAI, Meta and Mistral against the law's requirements -- on a scale of 0 (i.e. no compliance) to 1 (full compliance). Other evaluations are marked as N/A (i.e. not available, where there's a lack of data, or not applicable if the model maker doesn't make the capability available). (NB: At the time of writing there were also some minus scores recorded but we're told that was down to a bug in the Hugging Face interface.) LatticeFlow's framework evaluates LLM responses across 27 benchmarks such as "toxic completions of benign text", "prejudiced answers", "following harmful instructions", "truthfulness" and "common sense reasoning" to name a few of the benchmarking categories it's using for the evaluations. So each model gets a range of scores in each column (or else N/A). AI compliance a mixed bag So how did major LLMs do? There is no overall model score. So performance varies depending on exactly what's being evaluated -- but there are some notable highs and lows across the various benchmarks. For example there's strong performance for all the models on not following harmful instructions; and relatively strong performance across the board on not producing prejudiced answers -- whereas reasoning and general knowledge scores were a much more mixed bag. Elsewhere recommendation consistency, which the framework is using as a measure of fairness, was particularly poor for all models -- with none scoring above the halfway mark (and most scoring well below). Other areas -- such as training data suitability and watermark reliability and robustness -- appear essentially unevaluated on account of how many results are marked N/A. LatticeFlow does note there are certain areas where models' compliance is more challenging to evaluate, such as hot button issues like copyright and privacy. So it's not pretending it has all the answers. In a paper detailing work on the framework, the scientists involved in the project highlight how most of the smaller models they evaluated (≤ 13B parameters) "scored poorly on technical robustness and safety". They also found that "almost all examined models struggle to achieve high levels of diversity, non-discrimination, and fairness". "We believe that these shortcomings are primarily due to model providers disproportionally focusing on improving model capabilities, at the expense of other important aspects highlighted by the EU AI Act's regulatory requirements," they add, suggesting that as compliance deadlines start to bite LLM makes will be forced to shift their focus onto areas of concern -- "leading to a more balanced development of LLMs". Given no one yet knows exactly what will be required to comply with the EU AI Act LatticeFlow's framework is necessarily a work in progress. It is also only one interpretation of how the law's requirements could be translated into technical outputs that can be benchmarked and compared. But it's an interesting start on what will need to be an ongoing effort to probe powerful automation technologies and try to steer their developers towards safer utility. "The framework is a first step towards a full compliance-centered evaluation of the EU AI Act -- but is designed in a way to be easily updated to move in lock-step as the Act gets updated and the various working groups make progress," LatticeFlow CEO Petar Tsankov told TechCrunch. "The EU Commission supports this. We expect the community and industry to continue to develop the framework towards a full and comprehensive AI Act assessment platform." Summarizing the main takeaways so far, Tsankov said it's clear that AI models have "predominantly been optimized for capabilities rather than compliance". He also flagged "notable performance gaps" -- pointing out that some high capability models can be on a par with weaker models when it comes to compliance. Cyberattack resilience (at the model level) and fairness are areas of particular concern, per Tsankov, with many models scoring below 50% for the former area. "While Anthropic and OpenAI have successfully aligned their (closed) models to score against jailbreaks and prompt injections, open-source vendors like Mistral have put less emphasis on this," he said. And with "most models" performing equally poorly on fairness benchmarks he suggested this should be a priority for future work. On the challenges of benchmarking LLM performance in areas like copyright and privacy, Tsankov explained: "For copyright the challenge is that current benchmarks only check for copyright books. This approach has two major limitations: (i) it does not account for potential copyright violations involving materials other than these specific books, and (ii) it relies on quantifying model memorization, which is notoriously difficult. "For privacy the challenge is similar: the benchmark only attempts to determine whether the model has memorized specific personal information." LatticeFlow is keen for the free and open source framework to be adopted and improved by the wider AI research community. "We invite AI researchers, developers, and regulators to join us in advancing this evolving project," said professor Martin Vechev of ETH Zurich and founder and scientific director at INSAIT, who is also involved in the work, in a statement. "We encourage other research groups and practitioners to contribute by refining the AI Act mapping, adding new benchmarks, and expanding this open-source framework. "The methodology can also be extended to evaluate AI models against future regulatory acts beyond the EU AI Act, making it a valuable tool for organizations working across different jurisdictions."
[3]
EU AI Act Checker Reveals Big Tech's Compliance Pitfalls
Popular AI models were tested across categories in line with the AI Act Some of the most prominent artificial intelligence models are falling short of European regulations in key areas such as cybersecurity resilience and discriminatory output, according to data seen by Reuters. The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs (GPAI). Now a new tool designed by Swiss startup LatticeFlow and partners, and supported by European Union officials, has tested generative AI models developed by big tech companies like Meta and OpenAI across dozens of categories in line with the bloc's wide-sweeping AI Act, which is coming into effect in stages over the next two years. Awarding each model a score between 0 and 1, a leaderboard published by LatticeFlow on Wednesday showed models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received average scores of 0.75 or above. However, the company's "Large Language Model (LLM) Checker" uncovered some models' shortcomings in key areas, spotlighting where companies may need to divert resources in order to ensure compliance. Companies failing to comply with the AI Act will face fines of 35 million euros ($38 million) or 7% of global annual turnover. Mixed Results At present, the EU is still trying to establish how the AI Act's rules around generative AI tools like ChatGPT will be enforced, convening experts to craft a code of practice governing the technology by spring 2025. But LatticeFlow's test, developed in collaboration with researchers at Swiss university ETH Zurich and Bulgarian research institute INSAIT, offers an early indicator of specific areas where tech companies risk falling short of the law. For example, discriminatory output has been a persistent issue in the development of generative AI models, reflecting human biases around gender, race and other areas when prompted. When testing for discriminatory output, LatticeFlow's LLM Checker gave OpenAI's "GPT-3.5 Turbo" a relatively low score of 0.46. For the same category, Alibaba Cloud's "Qwen1.5 72B Chat" model received only a 0.37. Testing for "prompt hijacking", a type of cyberattack in which hackers disguise a malicious prompt as legitimate to extract sensitive information, the LLM Checker awarded Meta's "Llama 2 13B Chat" model a score of 0.42. In the same category, French startup Mistral's "8x7B Instruct" model received 0.38. "Claude 3 Opus", a model developed by Google-backed Anthropic, received the highest average score, 0.89. The test was designed in line with the text of the AI Act, and will be extended to encompass further enforcement measures as they are introduced. LatticeFlow said the LLM Checker would be freely available for developers to test their models' compliance online. Petar Tsankov, the firm's CEO and cofounder, told Reuters the test results were positive overall and offered companies a roadmap for them to fine-tune their models in line with the AI Act. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. "With a greater focus on optimising for compliance, we believe model providers can be well-prepared to meet regulatory requirements." Meta declined to comment. Alibaba, Anthropic, Mistral, and OpenAI did not immediately respond to requests for comment. While the European Commission cannot verify external tools, the body has been informed throughout the LLM Checker's development and described it as a "first step" in putting the new laws into action. A spokesperson for the European Commission said: "The Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements."
[4]
LatticeFlow releases framework for checking LLMs' compliance with the EU AI Act - SiliconANGLE
LatticeFlow releases framework for checking LLMs' compliance with the EU AI Act Startup LatticeFlow AG today released COMPL-AI, a framework that can help companies check whether their large language models comply with the EU AI Act. Zurich-based LatticeFlow is backed by more than $14 million in venture funding. It provides a platform for finding technical issues in artificial intelligence training datasets. Additionally, the company helps organizations ensure that their neural networks meet safety requirements. LatticeFlow created COMPL-AI in response to the rollout of the EU AI Act earlier this year. The legislation introduces a set of new rules for companies that offer advanced AI models in the bloc. Notably, AI applications that are deemed high-risk by regulators must follow stringent safety and transparency requirements. Some of the rules rolled out with the AI Act are only defined in relatively high-level terms, which means developers must interpret how they apply to their projects. That can complicate regulatory compliance efforts. According to LatticeFlow, its new COMPL-AI framework translates the high-level requirements set forth in the AI Act to concrete steps that developers can take to ensure regulatory compliance. COMPL-AI includes a list of technical requirements that must be met to ensure an LLM adheres to the legislation. Moreover, the framework provides an open-source compliance evaluation tool. The software can analyze an LLM to determine how thoroughly it implements AI Act rules. LatticeFlow says that its evaluation tool measures LLMs' regulatory compliance using 27 different benchmarks. Those benchmarks assess a model's reasoning capabilities, the frequency with which it generates harmful output and various other factors. "With this framework, any company -- whether working with public, custom, or private models -- can now evaluate their AI systems against the EU AI Act technical interpretation," said LatticeFlow co-founder and Chief Executive Officer Petar Tsankov. LatticeFlow put its open-source evaluation tool to the test by using it to analyze LLMs from several major AI providers. The companies on the list included OpenAI, Meta Platforms Inc., Google LLC, Anthropic PBC and Alibaba Group Holding Ltd. LatticeFlow determined that most of the evaluated AI models include effective guardrails against harmful output, but many fall short when it comes to cybersecurity and fairness. According to LatticeFlow, the results of the analysis also suggest that there are opportunities to refine some AI Act provisions. Using the current rules as a reference, the company's open-source evaluation tool found it challenging to measure how well LLMs protect user privacy. Assessing how well AI models address copyright considerations also proved to be difficult.
[5]
EU AI Act checker reveals Big Tech's compliance pitfalls
The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs (GPAI).Some of the most prominent artificial intelligence models are falling short of European regulations in key areas such as cybersecurity resilience and discriminatory output, according to data seen by Reuters. The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs (GPAI). Now a new tool designed by Swiss startup LatticeFlow and partners, and supported by European Union officials, has tested generative AI models developed by big tech companies like Meta and OpenAI across dozens of categories in line with the bloc's wide-sweeping AI Act, which is coming into effect in stages over the next two years. Awarding each model a score between 0 and 1, a leaderboard published by LatticeFlow on Wednesday showed models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received average scores of 0.75 or above. However, the company's "Large Language Model (LLM) Checker" uncovered some models' shortcomings in key areas, spotlighting where companies may need to divert resources in order to ensure compliance. Companies failing to comply with the AI Act will face fines of 35 million euros ($38 million) or 7% of global annual turnover. Mixed results At present, the EU is still trying to establish how the AI Act's rules around generative AI tools like ChatGPT will be enforced, convening experts to craft a code of practice governing the technology by spring 2025. But LatticeFlow's test, developed in collaboration with researchers at Swiss university ETH Zurich and Bulgarian research institute INSAIT, offers an early indicator of specific areas where tech companies risk falling short of the law. For example, discriminatory output has been a persistent issue in the development of generative AI models, reflecting human biases around gender, race and other areas when prompted. When testing for discriminatory output, LatticeFlow's LLM Checker gave OpenAI's "GPT-3.5 Turbo" a relatively low score of 0.46. For the same category, Alibaba Cloud's "Qwen1.5 72B Chat" model received only a 0.37. Testing for "prompt hijacking", a type of cyberattack in which hackers disguise a malicious prompt as legitimate to extract sensitive information, the LLM Checker awarded Meta's "Llama 2 13B Chat" model a score of 0.42. In the same category, French startup Mistral's "8x7B Instruct" model received 0.38. "Claude 3 Opus", a model developed by Google-backed Anthropic, received the highest average score, 0.89. The test was designed in line with the text of the AI Act, and will be extended to encompass further enforcement measures as they are introduced. LatticeFlow said the LLM Checker would be freely available for developers to test their models' compliance online. Petar Tsankov, the firm's CEO and cofounder, told Reuters the test results were positive overall and offered companies a roadmap for them to fine-tune their models in line with the AI Act. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. "With a greater focus on optimising for compliance, we believe model providers can be well-prepared to meet regulatory requirements." Meta declined to comment. Alibaba, Anthropic, Mistral, and OpenAI did not immediately respond to requests for comment. While the European Commission cannot verify external tools, the body has been informed throughout the LLM Checker's development and described it as a "first step" in putting the new laws into action. A spokesperson for the European Commission said: "The Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements." ($1 = 0.9173 euros)
[6]
Exclusive-EU AI Act checker reveals Big Tech's compliance pitfalls
LONDON (Reuters) - Some of the most prominent artificial intelligence models are falling short of European regulations in key areas such as cybersecurity resilience and discriminatory output, according to data seen by Reuters. The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs (GPAI). Now a new tool designed by Swiss startup LatticeFlow and partners, and supported by European Union officials, has tested generative AI models developed by big tech companies like Meta and OpenAI across dozens of categories in line with the bloc's wide-sweeping AI Act, which is coming into effect in stages over the next two years. Awarding each model a score between 0 and 1, a leaderboard published by LatticeFlow on Wednesday showed models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received average scores of 0.75 or above. However, the company's "Large Language Model (LLM) Checker" uncovered some models' shortcomings in key areas, spotlighting where companies may need to divert resources in order to ensure compliance. Companies failing to comply with the AI Act will face fines of 35 million euros ($38 million) or 7% of global annual turnover. MIXED RESULTS At present, the EU is still trying to establish how the AI Act's rules around generative AI tools like ChatGPT will be enforced, convening experts to craft a code of practice governing the technology by spring 2025. But LatticeFlow's test, developed in collaboration with researchers at Swiss university ETH Zurich and Bulgarian research institute INSAIT, offers an early indicator of specific areas where tech companies risk falling short of the law. For example, discriminatory output has been a persistent issue in the development of generative AI models, reflecting human biases around gender, race and other areas when prompted. When testing for discriminatory output, LatticeFlow's LLM Checker gave OpenAI's "GPT-3.5 Turbo" a relatively low score of 0.46. For the same category, Alibaba Cloud's "Qwen1.5 72B Chat" model received only a 0.37. Testing for "prompt hijacking", a type of cyberattack in which hackers disguise a malicious prompt as legitimate to extract sensitive information, the LLM Checker awarded Meta's "Llama 2 13B Chat" model a score of 0.42. In the same category, French startup Mistral's "8x7B Instruct" model received 0.38. "Claude 3 Opus", a model developed by Google-backed Anthropic, received the highest average score, 0.89. The test was designed in line with the text of the AI Act, and will be extended to encompass further enforcement measures as they are introduced. LatticeFlow said the LLM Checker would be freely available for developers to test their models' compliance online. Petar Tsankov, the firm's CEO and cofounder, told Reuters the test results were positive overall and offered companies a roadmap for them to fine-tune their models in line with the AI Act. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. "With a greater focus on optimising for compliance, we believe model providers can be well-prepared to meet regulatory requirements." Meta declined to comment. Alibaba, Anthropic, Mistral, and OpenAI did not immediately respond to requests for comment. While the European Commission cannot verify external tools, the body has been informed throughout the LLM Checker's development and described it as a "first step" in putting the new laws into action. A spokesperson for the European Commission said: "The Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements." (Reporting by Martin Coulter; Editing by Hugh Lawson)
[7]
EU AI Act checker reveals Big Tech's compliance pitfalls
Some of the most prominent artificial intelligence models are falling short of European regulations in key areas such as cybersecurity resilience and discriminatory output, according to data seen by Reuters. The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs. Now a new tool designed by Swiss startup LatticeFlow and partners, and supported by European Union officials, has tested generative AI models developed by big tech companies like Meta and OpenAI across dozens of categories in line with the bloc's wide-sweeping AI Act, which is coming into effect in stages over the next two years. Awarding each model a score between 0 and 1, a leaderboard published by LatticeFlow on Wednesday showed models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received average scores of 0.75 or above. However, the company's "Large Language Model (LLM) Checker" uncovered some models' shortcomings in key areas, spotlighting where companies may need to divert resources in order to ensure compliance. Companies failing to comply with the AI Act will face fines of $38 million or 7% of global annual turnover. Mixed results At present, the EU is still trying to establish how the AI Act's rules around generative AI tools like ChatGPT will be enforced, convening experts to craft a code of practice governing the technology by spring 2025. But LatticeFlow's test, developed in collaboration with researchers at Swiss university ETH Zurich and Bulgarian research institute INSAIT, offers an early indicator of specific areas where tech companies risk falling short of the law. For example, discriminatory output has been a persistent issue in the development of generative AI models, reflecting human biases around gender, race and other areas when prompted. When testing for discriminatory output, LatticeFlow's LLM Checker gave OpenAI's "GPT-3.5 Turbo" a relatively low score of 0.46. For the same category, Alibaba Cloud's 9988.HK "Qwen1.5 72B Chat" model received only a 0.37. Testing for "prompt hijacking," a type of cyberattack in which hackers disguise a malicious prompt as legitimate to extract sensitive information, the LLM Checker awarded Meta's "Llama 2 13B Chat" model a score of 0.42. In the same category, French startup Mistral's "8x7B Instruct" model received 0.38. "Claude 3 Opus," a model developed by Google-backed Anthropic, received the highest average score, 0.89. The test was designed in line with the text of the AI Act, and will be extended to encompass further enforcement measures as they are introduced. LatticeFlow said the LLM Checker would be freely available for developers to test their models' compliance online. Petar Tsankov, the firm's CEO and cofounder, told Reuters the test results were positive overall and offered companies a roadmap for them to fine-tune their models in line with the AI Act. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. "With a greater focus on optimizing for compliance, we believe model providers can be well-prepared to meet regulatory requirements." Meta declined to comment. Alibaba, Anthropic, Mistral, and OpenAI did not immediately respond to requests for comment. While the European Commission cannot verify external tools, the body has been informed throughout the LLM Checker's development and described it as a "first step" in putting the new laws into action. A spokesperson for the European Commission said: "The Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements."
[8]
EU AI Act checker exposes Big Tech compliance gaps | bobsguide
A new AI compliance tool developed by LatticeFlow AI has revealed that major models from Meta, OpenAI, and Alibaba may fall short of the European Union's evolving AI Act standards. Early tests show that these models could face challenges in areas like cybersecurity and discriminatory output, with scores highlighting significant regulatory gaps. A newly developed AI compliance tool has revealed that some of the leading artificial intelligence models created by Big Tech companies may struggle to meet the European Union's upcoming regulatory standards. As the EU's AI Act continues to evolve, early tests show that prominent AI models, including those from Meta, OpenAI, and Alibaba, are at risk of non-compliance in critical areas such as cybersecurity and discriminatory output. The compliance checker, created by Swiss startup LatticeFlow AI alongside researchers from ETH Zurich and INSAIT in Bulgaria, is designed to evaluate AI models in line with the EU AI Act, which will gradually come into force over the next two years. The tool, known as the "Large Language Model (LLM) Checker," assesses models across a range of categories including technical robustness, safety, and cybersecurity resilience, assigning scores between 0 and 1. Scores of less than 0.75 signal potential weaknesses in specific regulatory areas, which companies will need to address to avoid significant financial penalties. The LLM Checker's results, published by LatticeFlow, showed that while some models scored well overall, notable deficiencies were identified in key areas. For instance, OpenAI's widely used "GPT-3.5 Turbo" model received a concerning score of 0.46 for its performance in preventing discriminatory output -- an issue that reflects ongoing challenges in mitigating bias within AI systems. Alibaba's "Qwen1.5 72B Chat" model fared even worse, with a score of 0.37 in the same category. Cybersecurity vulnerabilities were also flagged, with Meta's "Llama 2 13B Chat" model receiving a score of just 0.42 for its ability to defend against prompt hijacking, a type of cyberattack where malicious actors disguise harmful prompts to extract sensitive data. French AI startup Mistral's model "8x7B Instruct" performed similarly poorly, scoring 0.38 in the same area. In contrast, Anthropic's "Claude 3 Opus" model, backed by Google, emerged as the top performer with an impressive overall score of 0.89, indicating stronger compliance readiness. Nevertheless, the varying performance across models underscores the need for further fine-tuning to meet the stringent requirements of the forthcoming AI Act. The EU's AI Act represents one of the most comprehensive regulatory frameworks globally, aimed at curbing the risks posed by artificial intelligence technologies while promoting innovation. Companies that fail to comply with the Act's provisions could face fines as high as €35 million or 7% of their global annual turnover. LatticeFlow's CEO, Petar Tsankov, emphasised that the tool offers companies a roadmap to adjust their models in line with evolving EU standards. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," Tsankov noted. "With a greater focus on optimising for compliance, we believe model providers can be well-prepared to meet regulatory requirements." While the European Commission has not yet officially endorsed the LLM Checker, it has been closely monitoring its development. A spokesperson for the Commission described the tool as a "first step" towards translating the AI Act's legal requirements into technical guidelines that companies can follow. As the EU moves forward with its AI Act, tech companies will need to prioritise compliance or face steep penalties. The mixed results from the LLM Checker provide early insights into the regulatory challenges ahead. While some AI developers are ahead of the curve, others must make significant improvements in critical areas like cybersecurity and bias mitigation to align with the forthcoming laws. With the clock ticking towards the full enforcement of the AI Act, the need for compliance tools like LatticeFlow's LLM Checker will only increase, offering Big Tech a clear path to regulatory adherence in an increasingly scrutinised AI landscape. As Tsankov concluded, "This is an opportunity for AI developers to proactively address these issues, rather than reacting when the regulations are already in force." For AI companies, the next few years will be pivotal in shaping their compliance strategies and ensuring their technologies meet Europe's stringent new standards.
[9]
Exclusive: EU AI Act checker reveals Big Tech's compliance pitfalls
LONDON, Oct 16 (Reuters) - Some of the most prominent artificial intelligence models are falling short of European regulations in key areas such as cybersecurity resilience and discriminatory output, according to data seen by Reuters. The EU had long debated new AI regulations before OpenAI released ChatGPT to the public in late 2022. The record-breaking popularity and ensuing public debate over the supposed existential risks of such models spurred lawmakers to draw up specific rules around "general-purpose" AIs (GPAI). Advertisement · Scroll to continue Now a new tool designed by Swiss startup LatticeFlow and partners, and supported by European Union officials, has tested generative AI models developed by big tech companies like Meta (META.O), opens new tab and OpenAI across dozens of categories in line with the bloc's wide-sweeping AI Act, which is coming into effect in stages over the next two years. Awarding each model a score between 0 and 1, a leaderboard published by LatticeFlow on Wednesday showed models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received average scores of 0.75 or above. Advertisement · Scroll to continue However, the company's "Large Language Model (LLM) Checker" uncovered some models' shortcomings in key areas, spotlighting where companies may need to divert resources in order to ensure compliance. Companies failing to comply with the AI Act will face fines of 35 million euros ($38 million) or 7% of global annual turnover. MIXED RESULTS At present, the EU is still trying to establish how the AI Act's rules around generative AI tools like ChatGPT will be enforced, convening experts to craft a code of practice governing the technology by spring 2025. But LatticeFlow's test, developed in collaboration with researchers at Swiss university ETH Zurich and Bulgarian research institute INSAIT, offers an early indicator of specific areas where tech companies risk falling short of the law. For example, discriminatory output has been a persistent issue in the development of generative AI models, reflecting human biases around gender, race and other areas when prompted. When testing for discriminatory output, LatticeFlow's LLM Checker gave OpenAI's "GPT-3.5 Turbo" a relatively low score of 0.46. For the same category, Alibaba Cloud's (9988.HK), opens new tab "Qwen1.5 72B Chat" model received only a 0.37. Testing for "prompt hijacking", a type of cyberattack in which hackers disguise a malicious prompt as legitimate to extract sensitive information, the LLM Checker awarded Meta's "Llama 2 13B Chat" model a score of 0.42. In the same category, French startup Mistral's "8x7B Instruct" model received 0.38. "Claude 3 Opus", a model developed by Google-backed (GOOGL.O), opens new tab Anthropic, received the highest average score, 0.89. The test was designed in line with the text of the AI Act, and will be extended to encompass further enforcement measures as they are introduced. LatticeFlow said the LLM Checker would be freely available for developers to test their models' compliance online. Petar Tsankov, the firm's CEO and cofounder, told Reuters the test results were positive overall and offered companies a roadmap for them to fine-tune their models in line with the AI Act. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. "With a greater focus on optimising for compliance, we believe model providers can be well-prepared to meet regulatory requirements." Meta declined to comment. Alibaba, Anthropic, Mistral, and OpenAI did not immediately respond to requests for comment. While the European Commission cannot verify external tools, the body has been informed throughout the LLM Checker's development and described it as a "first step" in putting the new laws into action. A spokesperson for the European Commission said: "The Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements." ($1 = 0.9173 euros) Reporting by Martin Coulter; Editing by Hugh Lawson Our Standards: The Thomson Reuters Trust Principles., opens new tab
[10]
AI companies fall short of meeting EU AI Act standards - study
Big Tech companies such as Apple and Meta have been cautious in rolling out their AI models in Europe. A new 'LLM Checker' could help. The leading generative artificial intelligence (GenAI) models, including OpenAI, Meta and Anthropic, do not fully comply with Europe's AI rules, according to a report released on Wednesday. Europe's AI Act came into force this August with the aim of establishing harmonised rules for AI systems so they do not become a threat to society. However, some technology companies such as Meta and Apple have not rolled out their AI models in Europe as they are cautious about the rules. A new tool and framework to make navigating the EU AI Act more simple for tech companies has been released by research institutes ETH Zurich and Bulgaria's Institute for Computer Science, AI and Technology (INSAIT) as well as the Swiss start-up LatticeFlow AI. It is the first EU AI Act compliance evaluation framework for GenAI. The tool gives AI models a score between 0 and 1 across categories such as safety to determine how they comply with the law. The large language model (LLM) checker, looked into AI models developed by Alibaba, Anthropic, OpenAI, Meta, and Mistral AI, which all received an average score of 0.75 or above. It looked at areas including cybersecurity, environmental well-being, and privacy and data governance. The study found that for the most part, several of the AI models fell short on discrimination and cybersecurity. For instance, OpenAI's GPT-4 Turbo scored 0.46 on discriminatory output and Alibaba's Cloud's scored 0.37. But most of the models performed well in terms of harmful content and toxicity requirements. Companies that do not comply with the EU AI Act face fines of €35 million or 7 per cent of global annual turnover, yet it can be difficult for technology companies to submit their evidence as there are no detailed technical guidelines for them to follow, those behind the LLM checker say. "If you want to comply with the EU AI Act, nobody knows how to provide the technical evidence that supports compliance with the Act. That's a very big challenge that needs to be addressed right now," Petar Tsankov, LatticeFlow AI's CEO and cofounder told Euronews Next. "Without this, companies would just not deploy in Europe because they don't know. You have a scary legislation that can bite you, and you don't know what to do about it. So it's very uncomfortable for companies," he said, adding that he would soon meet with Apple and OpenAI to discuss compliance with the AI Act. The European Commission has launched a consultation on the Code of Practice for providers of general-purpose AI (GPAI) models, which aims to supervise the implementation and enforcement of the AI Act. A European Commission spokesperson told Euronews Next that the Commission welcomed the study and AI model evaluation platform "as a first step in translating the EU AI Act into technical requirements, helping AI model providers implement the AI Act". "The Commission has kicked off work with stakeholders on the Code of Practice, which will detail the AI Act rules for providers of general-purpose AI models and general-purpose AI models with systemic risks. Providers should be able to rely on the Code of Practice to demonstrate compliance," the spokesperson added. As well as launching the first technical interpretation of the AI Act, there is also a free open source framework that can be used to evaluate LLMs against the EU's requirements. "We invite AI researchers, developers, and regulators to join us in advancing this evolving project," said Martin Vechev, a professor at ETH Zurich and founder and scientific director of INSAIT in Sofia, Bulgaria.
[11]
AI Models Fall Short of Key European Standards in New Compliance Test | PYMNTS.com
Several leading artificial intelligence models are struggling to meet stringent European Union regulations in areas such as cybersecurity resilience and the prevention of discriminatory outputs, according to data reviewed by Reuters. The results come from a newly developed tool designed to test compliance with the EU's upcoming Artificial Intelligence Act. The European Union has long debated regulations for AI systems, but the public release of OpenAI's ChatGPT in 2022 accelerated these discussions. The chatbot's rapid popularity and the surrounding concerns over potential existential risks prompted lawmakers to draw up specific rules aimed at "general-purpose" AI (GPAI) systems. In response, a new AI evaluation framework has been developed, offering insights into the performance of top-tier models against the incoming legal standards. A tool designed by Swiss startup LatticeFlow AI, in collaboration with research institutes ETH Zurich and Bulgaria's INSAIT, tested AI models from companies like OpenAI, Meta, Alibaba and others across numerous categories aligned with the EU's AI Act. This tool has been praised by European officials as a valuable resource for measuring AI models' readiness for compliance. According to Reuters, the AI models were assessed in areas like technical robustness, safety and other critical factors. The models received scores ranging from 0 to 1, with a higher score indicating greater compliance. Most models tested, including those from OpenAI, Meta and Alibaba, scored an average of 0.75 or above. However, the "Large Language Model (LLM) Checker" also revealed significant shortcomings in areas that will need improvement if these companies hope to avoid regulatory penalties. Companies that fail to meet the requirements of the AI Act could face fines of up to 35 million euros ($38 million), or 7% of their global annual turnover. Although the EU is still defining how rules around generative AI, such as ChatGPT, will be enforced, this tool provides early indicators of areas where compliance may be lacking. One of the most critical areas highlighted by the LLM Checker is the issue of discriminatory output. Many generative AI models have been found to reflect human biases related to gender, race and other factors. In this category, OpenAI's GPT-3.5 Turbo model received a score of 0.46, while Alibaba's Qwen1.5 72B Chat model fared even worse, scoring just 0.37. Cybersecurity vulnerabilities were also spotlighted. LatticeFlow tested for "prompt hijacking," a form of attack in which hackers use deceptive prompts to extract sensitive information. Meta's Llama 2 13B Chat model scored 0.42 in this category, while French startup Mistral's 8x7B Instruct model scored 0.38, according to Reuters. Anthropic's Claude 3 Opus, backed by Google, performed the best overall, receiving an average score of 0.89, making it the top performer across most categories. The LLM Checker was developed to align with the AI Act's evolving requirements and is expected to play a larger role as enforcement measures are introduced over the next two years. LatticeFlow has made the tool freely available, allowing developers to test their models' compliance online. Petar Tsankov, CEO and co-founder of LatticeFlow, told Reuters that while the results were generally positive, they also serve as a roadmap for companies to make necessary improvements. "The EU is still working out all the compliance benchmarks, but we can already see some gaps in the models," he said. Tsankov emphasized that with more focus on optimizing for compliance, AI developers can better prepare their models to meet the stringent standards of the AI Act. Although some companies declined to comment, including Meta and Mistral, and others like OpenAI, Anthropic and Alibaba did not respond to requests for comment, the European Commission has been following the tool's development closely. A spokesperson for the Commission stated that the platform represents "a first step" in translating the EU AI Act into technical compliance requirements, signaling that more detailed enforcement measures are on the way. This new test provides tech companies with valuable insights into the challenges ahead as they work to meet the EU's AI regulations, which are expected to be fully implemented by 2025.
[12]
EU Compliance Checker Exposes Major Flaws In Google and OpenAI AI Models
However, COMPL-AI revealed significant shortcomings in fairness and non-discrimination across all models. The EU's AI Act outlines the regulatory requirements of responsible AI development. However, it lacks a clear technical interpretation, making assessing the models' compliance difficult. To solve this problem, AI researchers have developed COMPL-AI , a new benchmarking framework highlighting potential shortcomings in several popular models, including those from leading developers like Google and OpenAI. Assessing AI Act Compliance The researchers started with the regulation's six organizing principles to assess AI Act compliance. They extrapolated a series of technical requirements relating to a specific principle. For example, the meta requirement for transparency can be subdivided into aspects such as AI processes' interpretability and model outputs' traceability. To measure how well AI models meet these technical requirements, COMPL-AI consists of a suite of benchmarks the researchers used to assess 12 popular AI models. The framework translates different benchmarking schemes to a 0-1 scoring system for comparability. How Different Models Faired The primary observation from the COMPL-AI framework is that no model achieves perfect marks. However, averaging scores across the different criteria reveals clear winners and losers. At the top of the table, GPT-4 Turbo achieved an average score of 0.81 across the different benchmarks. Meta's Llama 2-7B scored 0.67 overall, with poor performance observed especially for cyber attack resilience and the prevalence of AI bias. Cyber attack resilience showed the most variance as a broad category, with scores ranging from 0.39 ( Llama 2-7B) to 0.8 (Claude 3 Opus). A weakness observed in cyber resilience was the models' susceptibility to goal hijacking and prompt leakage. When evaluated using a TensorTrust-based benchmark, only Anthropic's Claude demonstrated high compliance, while most other models performed badly. Poor Performance in Fairness The category where models faired the worst across the board was fairness/absence of discrimination. Llama 2-70B was deemed the most compliant in this respect, scoring 0.63. Qwen 1.5-72B had the lowest score of 0.37 followed by GPT-3.5 Turbo with 0.46. "Almost all examined models struggle with diversity, non-discrimination, and fairness," the COMPL-AI researchers noted. "A likely reason for this is the disproportional focus on model capabilities, at the expense of other relevant concerns," they said, adding that they expect AI Act "will influence providers to shift their focus accordingly." High Compliance for Copyright Infringement and Harmful Content Two areas where all models performed well were their ability to avoid copyright infringement and toxic outputs. Article 53 (1c) of the AI Act states that language model outputs must not infringe upon intellectual property rights. When assessed for their adherence to this requirement, GPT-4 Turbo and Claude 3 Opus achieved perfect scores, indicating that they didn't output any copyrighted materials. No model scored lower than 0.98, indicating a high degree of compliance. The notion of harmful content is derived from the sixth principle of the AI Act, which states that AI systems should be developed "in a way to benefit all human beings while monitoring and assessing the long-term impacts on the individual, society, and democracy." Although no model was completely free of them, the prevalence of toxic outputs was extremely low, with scores in this category ranging from 0.96 to 0.98.
Share
Share
Copy Link
LatticeFlow, in collaboration with ETH Zurich and INSAIT, has developed the first comprehensive technical interpretation of the EU AI Act for evaluating Large Language Models (LLMs), revealing compliance gaps in popular AI models.
In a significant development for the artificial intelligence industry, Swiss startup LatticeFlow has unveiled COMPL-AI, the first comprehensive technical interpretation of the EU Artificial Intelligence Act (AI Act) for General Purpose AI (GPAI) models [1]. This groundbreaking framework, developed in collaboration with researchers from ETH Zurich and the Bulgarian AI research institute INSAIT, aims to translate the legal requirements of the EU AI Act into concrete, measurable, and verifiable technical benchmarks [2].
The EU AI Act, which came into force in August 2024, is the world's first comprehensive AI legislative package. However, the act's high-level legal requirements have been challenging to interpret technically, making it difficult for developers to create compliant AI models and for regulators to assess compliance [3]. LatticeFlow's framework addresses this gap by providing a practical approach for model developers to align with future EU legal requirements.
The COMPL-AI framework includes:
The evaluation tool, dubbed the "Large Language Model (LLM) Checker," assesses AI models across various categories, including cybersecurity resilience, discriminatory output, and fairness [5]. It awards scores between 0 (no compliance) and 1 (full compliance) for each benchmark.
LatticeFlow applied its benchmark approach to 12 prominent language models, including those from OpenAI, Meta, Google, Anthropic, and Alibaba. The results revealed that:
The evaluation uncovered several important insights:
The framework has garnered attention from the European Commission, which described it as a "first step" in implementing the new laws [5]. LatticeFlow CEO Petar Tsankov emphasized that while the overall test results were positive, they offer companies a roadmap for fine-tuning their models to align with the AI Act [5].
As the EU continues to establish enforcement mechanisms for the AI Act, the COMPL-AI framework is expected to evolve. LatticeFlow plans to extend the test to encompass further enforcement measures as they are introduced, and the LLM Checker will be freely available for developers to test their models' compliance online [4][5].
The introduction of the COMPL-AI framework marks a significant milestone in the implementation of the EU AI Act. As companies face potential fines of up to 35 million euros or 7% of global annual turnover for non-compliance, this tool provides a crucial resource for AI developers and policymakers alike [5]. The framework not only highlights current shortcomings in popular AI models but also paves the way for more responsible and compliant AI development in the future.
Reference
[2]
[3]
[4]
[5]
The European Union's AI Act, a risk-based rulebook for artificial intelligence, is nearing implementation with the release of draft guidelines for general-purpose AI models. This landmark legislation aims to foster innovation while ensuring AI remains human-centered and trustworthy.
3 Sources
Major technology companies are pushing for changes to the European Union's AI Act, aiming to reduce regulations on foundation models. This effort has sparked debate about balancing innovation with potential risks of AI technology.
9 Sources
The European Commission has selected a panel of 13 international experts to develop a code of practice for generative AI. This initiative aims to guide AI companies in complying with the EU's upcoming AI Act.
5 Sources
As AI regulations evolve globally, companies face new challenges in compliance and patent strategies. This article explores key compliance measures and the impact of the EU AI Act on patent approaches.
2 Sources
Meta Platforms has announced a delay in launching its latest AI models in the European Union, citing concerns over unclear regulations. This decision highlights the growing tension between technological innovation and regulatory compliance in the AI sector.
13 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved