Microsoft's MDASH AI system discovers 16 Windows vulnerabilities using 100+ specialized agents

Reviewed byNidhi Govil

6 Sources

Share

Microsoft unveiled MDASH, a multi-model agentic system that orchestrates over 100 specialized AI agents to find software vulnerabilities at enterprise scale. The system discovered 16 Windows flaws, including four critical remote code execution bugs, and topped the CyberGym benchmark with an 88.45% score, surpassing Anthropic's Mythos and OpenAI's GPT-5.5.

Microsoft's MDASH Redefines AI Cybersecurity at Production Scale

Microsoft has introduced MDASH, a multi-model agentic system that marks a significant shift in how AI cybersecurity tools discover and validate software vulnerabilities. Standing for multi-model agentic scanning harness, MDASH orchestrates more than 100 specialized AI agents across multiple frontier and distilled models to autonomously identify, debate, and prove exploitable defects in complex codebases like Windows

2

3

. The system has already demonstrated production-grade capabilities by uncovering 16 Windows vulnerabilities fixed in this month's Patch Tuesday release, including four critical remote code execution flaws in components such as the Windows kernel TCP/IP stack and the IKEv2 service

2

5

.

Source: Hacker News

Source: Hacker News

The strategic implication is clear: AI vulnerability discovery has crossed from research curiosity into enterprise-scale defense, with Microsoft telling customers to expect larger Patch Tuesday updates going forward as AI accelerates the pace of finding flaws

4

. Unlike single-model approaches from competitors, Microsoft's MDASH achieves what the company calls a "durable advantage" by efficiently leveraging multiple models rather than relying on any single one

2

3

.

MDASH Outperforms Anthropic and OpenAI on CyberGym Benchmark

Source: GeekWire

Source: GeekWire

Microsoft's MDASH scored 88.45% on the CyberGym benchmark, a test developed by UC Berkeley researchers that measures how well AI systems can reproduce real-world vulnerabilities across 1,507 tasks drawn from 188 open-source software projects

2

4

. This performance placed it ahead of Anthropic's cybersecurity-focused Claude Mythos Preview, which scored 83.1%, and OpenAI's GPT-5.5, which achieved 81.8%

1

4

. The British government's AISI, which evaluates AI models, confirmed that both Mythos and GPT-5.5 showed progress well above previous trends on cybersecurity testing, while separately, XBOW released data suggesting frontier models have taken a major step forward in vulnerability discovery

1

.

The benchmark results highlight a key architectural difference: while Mythos is a single AI model running inside an agent framework, MDASH operates as a structured pipeline that ingests codebases and produces validated findings through specialized agent roles at each stage

3

4

. Microsoft emphasizes that no single model excels at every stage, which is why the AI-powered security system runs a configurable panel of models

2

.

How the Multi-Model Agentic System Discovers and Validates Flaws

Source: SiliconANGLE

Source: SiliconANGLE

MDASH operates through a structured pipeline of prepare, scan, validate, dedup, and prove stages, with specialized AI agents constructed based on past CVEs and their patches

3

5

. The system starts by analyzing source code to build a threat model and attack surface, then runs specialized "auditor" agents over candidate code paths to flag potential issues. A second set of "debater" agents validates the findings, with disagreement between models serving as a signal that increases a finding's credibility

2

3

.

State-of-the-art models handle heavy reasoning, distilled models act as cost-effective validators for high-volume passes, and a separate frontier model provides independent counterpoint

3

5

. Domain plugins inject context that foundation models cannot infer on their own, including kernel calling conventions, lock invariants, and interprocess communication trust boundaries

5

.

Critical Windows Vulnerabilities Discovered Include Remote Code Execution Flaws

Among the 16 Windows vulnerabilities discovered by MDASH, four were rated critical, including CVE-2026-33827, a remote unauthenticated use-after-free in tcpip.sys triggered by crafted IPv4 packets, and CVE-2026-33824, a double-free in the IKEv2 service reachable over UDP port 500 that yields code execution as LocalSystem

5

. The flaws spanned the Windows TCP/IP stack, the IKEEXT IPsec service, HTTP.sys, Netlogon, DNS resolution, and the Telnet client, with ten kernel-mode and six usermode vulnerabilities, most reachable from a network position without credentials

5

.

These were not simple bugs that a single-pass scanner would typically surface. The tcpip.sys flaw involved a reference-counted Path object whose ownership was dropped before later reuse, with three independent concurrent free paths in play, while the IKEEXT double-free spanned six source files

5

. On internal testing, MDASH identified all 21 planted vulnerabilities in a private test driver called StorageDrive with zero false positives and recorded 96% recall on clfs.sys and 100% on tcpip.sys against five years of confirmed Microsoft Security Response Center cases

5

.

The Arms Race Between AI Defense and Offensive Hacking Tools

The introduction of MDASH highlights growing concerns about AI's dual-use nature in cybersecurity. The same capabilities that allow AI to find software vulnerabilities in friendly hands can be used to discover them for exploitation by attackers engaged in offensive hacking

4

. Microsoft acknowledges that MDASH "can approximate professional offensive researchers," which is why the company is limiting access through a private preview with select enterprise customers who apply

2

.

Microsoft's security engineering teams have been using MDASH internally alongside a small set of customers as part of a limited private preview

2

5

. The company, which has faced persistent criticism over security lapses, is betting that multiple models can discover vulnerabilities at a pace that individual models can't match

4

. Taesoo Kim, vice president of agentic security at Microsoft, stated that "the durable advantage lies in the agentic system around the model rather than any single model itself"

3

5

. Several members of the Autonomous Code Security team came from Team Atlanta, which won first place in the $20 million DARPA AI Cyber Challenge by building an autonomous cyber-reasoning system

5

.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved