Cisco finds no closed frontier AI model safe from multi-turn attacks across major providers

Reviewed byNidhi Govil

2 Sources

Share

Cisco's AI Threat Research team tested 15 proprietary flagship models from OpenAI, Anthropic, Google, Amazon, and xAI, revealing multi-turn attack success rates between 7.9% and 88.3%. The study exposes how single-turn safety benchmarks fail to capture real-world adversarial behavior, where attackers iterate and adapt across conversations. Even the strongest models like Claude showed vulnerability, with rates climbing from 2.2% to 16.2% under iterative pressure.

Cisco Reveals Critical Gap in Frontier Models Safety Testing

A comprehensive evaluation by Cisco's AI Threat Research team has exposed a fundamental vulnerability across the closed frontier AI model landscape. Testing 15 proprietary flagship models from OpenAI, Anthropic, Google, Amazon, and xAI, the research found that multi-turn attacks achieved success rates ranging from 7.89% to 88.30%, compared to single-turn attacks that registered 2.19% to 64.91% on the same models

1

2

. The findings challenge the dominant safety benchmarks that inform model cards and procurement decisions across the industry, which assume a single prompt and response adequately characterize model behavior under adversarial attack.

Single-Turn Benchmarks Mask Real-World LLM Vulnerabilities

The evaluation demonstrates that single-turn attack success rates cannot serve as reliable proxies for what happens when attackers adapt across conversations. OpenAI's GPT-5.4 moved from 2.74% single-turn to 24.68% multi-turn, representing a roughly nine-times increase

2

. Google's Gemini 3 Pro shifted from 18.10% to 73.35%, while xAI's Grok 4.1 Fast in its non-reasoning configuration climbed from 34.2% to 88.30%

1

. Even Anthropic's Claude family, which demonstrated the strongest single-turn refusal rates at 2.19% to 3.64%, reached 11.16% to 16.20% under iterative pressure

1

. The two regimes did not produce the same model ordering, meaning models that appeared strong on single-turn benchmarks did not necessarily maintain that performance when attackers could keep talking.

Source: SiliconANGLE

Source: SiliconANGLE

Multi-Turn Adversarial Attacks Reflect Actual Threat Landscape

The research emphasizes that iterative attack scenarios matter because they reflect how real adversaries operate. Attackers reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually—behaviors that single-turn benchmarks cannot capture

1

. Cisco's evaluation drew on 30,090 single-turn prompts and 6,986 multi-turn attacks distributed across 1,456 conversations, all scored under the Cisco Integrated AI Security and Safety Framework taxonomy

2

. Strategy families included role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition and reassembly, and crescendo-style incremental escalation.

Deployment-Time Configurations Impact Security Posture

A significant finding concerns how deployment-time configurations affect adversarial success rates. The same Grok 4.1 Fast model dropped from an 88.3% multi-turn attack success rate to 43.5% once reasoning mode was enabled—a swing not captured by any public benchmark or model card the researchers reviewed

2

. Cisco called on model providers to document the safety effects of configuration flags such as reasoning modes, system-prompt adherence settings, temperature, and guardrail tiers alongside the capability benchmarks they already publish.

Amazon Nova 2 Lite Shows Inverted Risk Profile

Amazon's Nova 2 Lite produced the cleanest inversion in the cohort, with a relatively high single-turn rate of 34.05% but the lowest multi-turn rate at 7.89%

1

2

. This result illustrates why single-turn scores alone cannot be treated as proxies for adversarial robustness and presents governance risks for business decisions made on the basis of published single-turn scores. Cross-regime deltas ranged from -34.74 percentage points to +55.25 percentage points, with eight of 15 models exceeding an absolute gap of 15 percentage points in both directions

1

.

Pattern Extends Beyond Proprietary Models

This study follows Cisco's earlier "Death by a Thousand Prompts" assessment of eight open-weight LLMs, which found multi-turn success rates two to 10 times higher than single-turn baselines, reaching 92.78% against Mistral Large-2

1

2

. The pattern documented in open models holds in closed ones, suggesting multi-turn vulnerability is a structural property of the current frontier rather than an artifact of open-weight alignment choices or capability-first development.

Compliance and AI Risk Management Frameworks at Stake

The findings carry compliance implications for organizations deploying frontier models. NIST's AI Risk Management Framework, its draft Cyber AI Profile, and Article 15 of the European Union AI Act all require adversarial robustness testing without specifying how many turns coverage should include or which attack strategies should be in scope

2

. Cisco recommends organizations ask labs to publish attack success rates broken down by strategy family on every model release, gate deployments on regressions in top procedures and content categories with a three-percentage-point threshold, and flag any model with a cross-regime gap larger than 15 percentage points for manual review. In the tested cohort, that last rule alone would surface eight of 15 models, including GPT-5.4, Gemini 3 Pro, both Grok configurations, and all three Nova variants

2

. The findings inform Cisco's AI Defense product and the LLM Security Leaderboard, which publishes adversarial evaluation signals against leading models.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved