OpenAI and Paradigm Launch EVMbench to Test AI Agents on Ethereum Smart Contract Security

Reviewed byNidhi Govil

3 Sources

Share

OpenAI and crypto investment firm Paradigm unveiled EVMbench, a benchmark tool designed to evaluate how AI agents detect, patch, and exploit vulnerabilities in Ethereum smart contracts. Drawing on 120 real flaws from 40 audits, the tool tests models like GPT-5.3-Codex, which scored 72.2% in exploit mode, as billions in crypto assets remain at risk.

OpenAI and Paradigm Introduce EVMbench for Smart Contract Security

ChatGPT maker OpenAI has partnered with crypto investment firm Paradigm and security firm OtterSec to launch EVMbench, a smart contract security benchmark designed to evaluate how AI agents detect, patch, and exploit vulnerabilities in Ethereum smart contracts

1

2

. The tool arrives as smart contracts secure over $100 billion in crypto assets and attackers stole $3.4 billion worth of funds in 2025

3

2

. EVMbench draws on 120 curated vulnerabilities from 40 professional audits, with most sourced from open audit competitions such as Code4rena

1

3

. The benchmark also includes scenarios from security auditing work for Tempo, Stripe's purpose-built layer-1 blockchain focused on high-throughput, low-cost stablecoin payments

1

.

Source: CoinGape

Source: CoinGape

Testing AI Performance in Economically Meaningful Environments

OpenAI emphasized that measuring AI performance in economically meaningful environments has become critical as models evolve into powerful tools for both cyber attackers and defenders

1

. The weekly number of Ethereum smart contracts deployed reached an all-time high of 1.7 million in November 2025, with 669,500 deployed last week alone, according to Token Terminal

1

. EVMbench evaluates AI models across three distinct modes: detect, patch, and exploit

1

3

. In detection mode, agents audit repositories and receive scores based on their recall of confirmed smart contract vulnerabilities. In patch mode, agents must eliminate vulnerabilities without breaking intended functionality. The exploit mode phase tests agents on end-to-end fund-draining attacks in a sandboxed blockchain environment, with grading performed via deterministic transaction replay

1

3

.

Source: Cointelegraph

Source: Cointelegraph

GPT-5.3-Codex Leads in Exploit Mode Performance

In exploit mode testing, GPT-5.3-Codex running via OpenAI's Codex CLI achieved a score of 72.2%, compared to 31.9% for GPT-5, which was released just six months earlier

1

3

. However, performance was weaker in the detect and patch tasks, where agents sometimes failed to audit exhaustively or struggled to preserve full contract functionality

1

. In separate testing focused on detection capabilities, Anthropic's Claude Opus 4.6 came out on top with an average detect award of $37,824, followed by OpenAI's OC-GPT-5.2 at $31,623 and Google's Gemini 3 Pro at $25,112

2

. OpenAI researchers cautioned that EVMbench does not fully capture real-world security complexity, but the benchmark provides a foundation for tracking AI progress in spotting and mitigating vulnerabilities at scale

1

2

.

AI for Cyber Defense and the Future of Stablecoin Payments

OpenAI stated that smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders

2

. The company added that it expects agentic stablecoin payments to grow, grounding the benchmark in a domain of emerging practical importance

2

. Circle CEO Jeremy Allaire predicted on January 22 that billions of AI agents will be transacting with stablecoins for everyday payments on behalf of users within five years

2

. Former Binance boss Changpeng Zhao also recently suggested that crypto would become the native currency for AI agents

2

. Dragonfly's managing partner Haseeb Qureshi believes the future of crypto transactions will be facilitated by AI-intermediated, self-driving wallets that manage complex operations and threats on behalf of users

2

.

OpenAI Expands Agent Development Team

As OpenAI pushed EVMbench into public view, the company also expanded its agent development team by hiring Peter Steinberger, founder of the viral open-source AI agent project OpenClaw, previously known as Clawdbot

3

. Sam Altman confirmed on X that Steinberger will join OpenAI to lead work on the next generation of personal agents

3

. OpenClaw will transition into a foundation model project supported by OpenAI, with the open-source project continuing under that structure

3

. The hiring signals OpenAI's increased focus on autonomous and next-generation personal AI agent capabilities as the convergence between AI and crypto accelerates.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo