In short
- EVMbench evaluates AI representatives on 120 real-world Ethereum clever agreement vulnerabilities.
- Tool examines detection, patching, and exploitation throughout 3 unique modes.
- GPT-5.3- Codex accomplished 72.2% success rate in make use of mode screening.
ChatGPT maker OpenAI and crypto-focused financial investment company Paradigm have actually presented EVMbench, a tool to assist enhance Ethereum Virtual Maker clever agreement security.
EVMbench is created to examine AI representatives’ capability to find, spot, and make use of high-severity vulnerabilities in Ethereum Virtual Maker (EVM) clever agreements.
Smart agreements are the heart of the Ethereum network, holding the code that powers whatever from decentralized financing procedures to token launches. The weekly variety of clever agreements released on Ethereum reached an all-time high of 1.7 million in November 2025, with 669,500 released recently alone, according to Token Terminal.
EVMbench makes use of 120 curated vulnerabilities from 40 audits, the majority of sourced from open audit competitors such as Code4rena, according to an OpenAI article. It likewise consists of situations from the security auditing procedure for Pace, Stripe’s purpose-built layer-1 blockchain concentrated on high-throughput, affordable stablecoin payments.
Payments huge Stripe released the general public testnet for Pace in December, stating at the time that it was being constructed with input from Visa, Shopify, and OpenAI, to name a few.
The objective is to ground screening in financially significant, real-world code– especially as AI-driven stablecoin payments broaden, the company included.
Presenting EVMbench– a brand-new criteria that determines how well AI representatives can find, make use of, and spot high-severity clever agreement vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
EVMbench is indicated to examine AI designs throughout 3 modes: Find, spot, and make use of. In “find,” representatives audit repositories and are scored on their recall of ground-truth vulnerabilities. In “spot,” representatives should remove vulnerabilities without breaking desired performance. Lastly, in the “make use of” stage, representatives try end-to-end fund-draining attacks in a sandboxed blockchain environment, with grading carried out through deterministic deal replay.
In make use of mode, GPT-5.3- Codex running through OpenAI’s Codex CLI accomplished a rating of 72.2%, compared to 31.9% for GPT-5, which was launched 6 months previously. Efficiency was weaker in the find and spot jobs, where representatives in some cases stopped working to investigate extensively or had a hard time to maintain complete agreement performance.
The ChatGPT makers’ scientists warned that EVMbench does not totally catch real-world security intricacy. Still, they included that determining AI efficiency in financially pertinent environments is important as designs end up being effective tools for both opponents and protectors.
Sam Altman’s OpenAI and Ethereum co-founder Vitalik Buterin have actually formerly been at chances over the speed of AI advancement.
In January 2025, Altman stated that his company was “positive we understand how to develop AGI as we have actually generally comprehended it.” However Buterin promoted that AI systems must consist of a “soft time out” ability that might briefly limit industrial-scale AI operations if indication emerge.
Daily Debrief Newsletter
Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.
