3 Sources
3 Sources
[1]
Can AI Agents Boost Ethereum Security? OpenAI and Paradigm Created a Testing Ground - Decrypt
ChatGPT maker OpenAI and crypto-focused investment firm Paradigm have introduced EVMbench, a tool to help improve Ethereum Virtual Machine smart contract security. EVMbench is designed to evaluate AI agents' ability to detect, patch, and exploit high-severity vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts. Smart contracts are the heart of the Ethereum network, holding the code that powers everything from decentralized finance protocols to token launches. The weekly number of smart contracts deployed on Ethereum reached an all-time high of 1.7 million in November 2025, with 669,500 deployed last week alone, according to Token Terminal. EVMbench draws on 120 curated vulnerabilities from 40 audits, most sourced from open audit competitions such as Code4rena, according to an OpenAI blog post. It also includes scenarios from the security auditing process for Tempo, Stripe's purpose-built layer-1 blockchain focused on high-throughput, low-cost stablecoin payments. Payments giant Stripe launched the public testnet for Tempo in December, saying at the time that it was being built with input from Visa, Shopify, and OpenAI, among others. The goal is to ground testing in economically meaningful, real-world code -- particularly as AI-driven stablecoin payments expand, the firm added. EVMbench is meant to evaluate AI models across three modes: Detect, patch, and exploit. In "detect," agents audit repositories and are scored on their recall of ground-truth vulnerabilities. In "patch," agents must eliminate vulnerabilities without breaking intended functionality. Finally, in the "exploit" phase, agents attempt end-to-end fund-draining attacks in a sandboxed blockchain environment, with grading performed via deterministic transaction replay. In exploit mode, GPT-5.3-Codex running via OpenAI's Codex CLI achieved a score of 72.2%, compared to 31.9% for GPT-5, which was released six months earlier. Performance was weaker in the detect and patch tasks, where agents sometimes failed to audit exhaustively or struggled to preserve full contract functionality. The ChatGPT makers' researchers cautioned that EVMbench does not fully capture real-world security complexity. Still, they added that measuring AI performance in economically relevant environments is critical as models become powerful tools for both attackers and defenders. Sam Altman's OpenAI and Ethereum co-founder Vitalik Buterin have previously been at odds over the pace of AI development. In January 2025, Altman said that his firm was "confident we know how to build AGI as we have traditionally understood it." But Buterin advocated that AI systems should include a "soft pause" capability that could temporarily restrict industrial-scale AI operations if warning signs emerge.
[2]
OpenAI Researches AI Agents Detecting Smart Contract Flaws
OpenAI said it is becoming increasingly important to evaluate the performance of AI agents in "economically meaningful environments" as their adoption grows. OpenAI has launched a new benchmark that evaluates how well different AI models detect, patch, and even exploit security vulnerabilities found in crypto smart contracts. OpenAI released the "EVMbench: Evaluating AI Agents on Smart Contract Security" paper on Wednesday, in collaboration with crypto investment firm Paradigm and crypto security firm OtterSec, to evaluate how much the AI agents could theoretically exploit from 120 smart contract vulnerabilities. Anthropic's Claude Opus 4.6 came out on top with an average "detect award" of $37,824, followed by OpenAI's OC-GPT-5.2 and Google's Gemini 3 Pro at $31,623 and $25,112, respectively. While AI agents are becoming increasingly efficient at handling basic tasks, OpenAI said it is becoming more important to evaluate their performance in "economically meaningful environments." "Smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders." "We expect agentic stablecoin payments to grow, and help ground it in a domain of emerging practical importance," OpenAI added. Circle CEO Jeremy Allaire predicted on Jan. 22 that billions of AI agents will be transacting with stablecoins for everyday payments on behalf of users within five years, while former Binance boss Changpeng "CZ" Zhao also recently tipped that crypto would end up being the "native currency for AI agents." The need to test agentic AI performance in spotting security vulnerabilities comes as attackers stole $3.4 billion worth of crypto funds in 2025, a marginal increase from 2024. Related: China's AI lead will shape crypto's future EVMbench drew on 120 curated vulnerabilities from 40 smart contract audits, with most of them sourced from open-source audit competitions. OpenAI said it hopes the benchmark will help track AI progress in spotting and mitigating smart contract vulnerabilities at scale. In a post to X on Wednesday, Dragonfly's managing partner Haseeb Qureshi said crypto's promise of replacing property rights and legal contracts never materialized, not because the technology failed, but because it was never designed for human intuition. Qureshi said it still feels "terrifying" to sign large transactions, particularly with drainer wallets and other threats always present, whereas bank transfers rarely provoke the same fear. Instead, Qureshi believes the future of crypto transactions will be facilitated by AI-intermediated, self-driving wallets, which will take care of those threats and manage complex operations on behalf of users: "A technology often snaps into place once its complement finally arrives. GPS had to wait for the smartphone, TCP/IP had to wait for the browser. For crypto, we might just have found it in AI agents."
[3]
OpenAI Introduces Smart Contract Benchmark for AI Agents as AI and Crypto Converge
EVMbench uses 120 real flaws from 40 audits, including Code4rena and Tempo work. OpenAI has introduced a new smart contract security benchmark as AI agents gain stronger coding abilities in the crypto sector. Together with Paradigm, OpenAI said the benchmark, called EVMbench, tests how AI systems detect, patch, and exploit serious Ethereum contract bugs. Their effort responds to the growing financial risk, since smart contracts routinely secure over $100 billion in open-source crypto assets. OpenAI Smart Contract Benchmark Targets Real Audit Vulnerabilities In their release, OpenAI said EVMbench draws on 120 curated vulnerabilities collected from 40 professional smart contract audits. Notably, most of the issues came from open audit competitions, including Code4rena. OpenAI said the benchmark also includes vulnerability scenarios tied to security auditing work for the Tempo blockchain. Tempo is described as a purpose-built Layer-1 network designed for high-throughput, low-cost stablecoin payments. Because of that, these scenarios extend the benchmark into payment-focused contract code. The company also said it expects agent-based stablecoin payment activity to grow. To build the benchmark environments, OpenAI said it adapted existing exploit proof-of-concept tests and deployment scripts when available. However, it said engineers manually wrote missing components when no scripts existed. OpenAI added that it ensured patch tasks remained exploitable while still fixable without breaking compilation. Detect, Patch, Exploit Modes Test AI Agents Under Pressure OpenAI said EVMbench evaluates artificial intelligence agents in three modes. That is detect, patch, and exploit. In detect mode, agents audit smart contract repositories and get scored on recall of confirmed vulnerabilities and audit rewards. In patch mode, agents must modify vulnerable contracts while keeping intended functionality intact. Exploit mode, however, focuses on full end-to-end fund draining attacks in a sandbox blockchain environment. The company said graders verify results using transaction replay and on-chain checks. To support reproducible evaluation, the company said it developed a Rust-based harness to deploy contracts and replay transactions deterministically. Notably, the exploit tasks run in an isolated local Anvil environment instead of live crypto networks. It also said vulnerabilities used in the benchmark are historical and publicly documented. OpenAI added that the harness restricts unsafe RPC methods to limit abuse. In exploit testing, OpenAI said GPT-5.3-Codex running via Codex CLI scored 72.2%. However, it said the earlier GPT-5 model scored 31.9%, despite being released just over six months earlier. OpenAI also noted that detect recall and patch success remain below full coverage. OpenAI Adds New Talent with Agent Hire While OpenAI pushed EVMbench into public view, it also expanded its agent development team. Notably, they hired Peter Steinberger, founder of the viral open-source AI agent project OpenClaw, previously known as Clawdbot. Sam Altman confirmed on X that Steinberger will join OpenAI to lead work on the "next generation of personal agents." Meanwhile, Altman said OpenClaw will transition into a foundation model project supported by OpenAI. The open-source project will continue under that structure, according to the announcement. The hiring drew wide attention as OpenAI increases its focus on autonomous and personal AI agents.
Share
Share
Copy Link
OpenAI and crypto investment firm Paradigm unveiled EVMbench, a benchmark tool designed to evaluate how AI agents detect, patch, and exploit vulnerabilities in Ethereum smart contracts. Drawing on 120 real flaws from 40 audits, the tool tests models like GPT-5.3-Codex, which scored 72.2% in exploit mode, as billions in crypto assets remain at risk.
ChatGPT maker OpenAI has partnered with crypto investment firm Paradigm and security firm OtterSec to launch EVMbench, a smart contract security benchmark designed to evaluate how AI agents detect, patch, and exploit vulnerabilities in Ethereum smart contracts
1
2
. The tool arrives as smart contracts secure over $100 billion in crypto assets and attackers stole $3.4 billion worth of funds in 20253
2
. EVMbench draws on 120 curated vulnerabilities from 40 professional audits, with most sourced from open audit competitions such as Code4rena1
3
. The benchmark also includes scenarios from security auditing work for Tempo, Stripe's purpose-built layer-1 blockchain focused on high-throughput, low-cost stablecoin payments1
.
Source: CoinGape
OpenAI emphasized that measuring AI performance in economically meaningful environments has become critical as models evolve into powerful tools for both cyber attackers and defenders
1
. The weekly number of Ethereum smart contracts deployed reached an all-time high of 1.7 million in November 2025, with 669,500 deployed last week alone, according to Token Terminal1
. EVMbench evaluates AI models across three distinct modes: detect, patch, and exploit1
3
. In detection mode, agents audit repositories and receive scores based on their recall of confirmed smart contract vulnerabilities. In patch mode, agents must eliminate vulnerabilities without breaking intended functionality. The exploit mode phase tests agents on end-to-end fund-draining attacks in a sandboxed blockchain environment, with grading performed via deterministic transaction replay1
3
.
Source: Cointelegraph
In exploit mode testing, GPT-5.3-Codex running via OpenAI's Codex CLI achieved a score of 72.2%, compared to 31.9% for GPT-5, which was released just six months earlier
1
3
. However, performance was weaker in the detect and patch tasks, where agents sometimes failed to audit exhaustively or struggled to preserve full contract functionality1
. In separate testing focused on detection capabilities, Anthropic's Claude Opus 4.6 came out on top with an average detect award of $37,824, followed by OpenAI's OC-GPT-5.2 at $31,623 and Google's Gemini 3 Pro at $25,1122
. OpenAI researchers cautioned that EVMbench does not fully capture real-world security complexity, but the benchmark provides a foundation for tracking AI progress in spotting and mitigating vulnerabilities at scale1
2
.Related Stories
OpenAI stated that smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders
2
. The company added that it expects agentic stablecoin payments to grow, grounding the benchmark in a domain of emerging practical importance2
. Circle CEO Jeremy Allaire predicted on January 22 that billions of AI agents will be transacting with stablecoins for everyday payments on behalf of users within five years2
. Former Binance boss Changpeng Zhao also recently suggested that crypto would become the native currency for AI agents2
. Dragonfly's managing partner Haseeb Qureshi believes the future of crypto transactions will be facilitated by AI-intermediated, self-driving wallets that manage complex operations and threats on behalf of users2
.As OpenAI pushed EVMbench into public view, the company also expanded its agent development team by hiring Peter Steinberger, founder of the viral open-source AI agent project OpenClaw, previously known as Clawdbot
3
. Sam Altman confirmed on X that Steinberger will join OpenAI to lead work on the next generation of personal agents3
. OpenClaw will transition into a foundation model project supported by OpenAI, with the open-source project continuing under that structure3
. The hiring signals OpenAI's increased focus on autonomous and next-generation personal AI agent capabilities as the convergence between AI and crypto accelerates.Summarized by
Navi
[1]
[2]
02 Dec 2025β’Technology

30 Oct 2025β’Technology

11 Dec 2025β’Policy and Regulation

1
Technology

2
Business and Economy

3
Technology
