2 Sources
[1]
Proprietary Problems: No Frontier Model Is Multi-Turn Immune
The dominant safety benchmarks for frontier large language models (LLMs) share a structural assumption: that a single prompt and a single model response are enough to characterize how a model behaves under adversarial attack. These benchmarks inform model cards, safety reports, and procurement decisions across the industry, but they all only measure one narrow slice of attacker behavior. In a paired-regime evaluation of 15 closed/proprietary flagship models from OpenAI, Anthropic, Google, Amazon, and xAI, we found that single-turn attack success rate (ASR) is not a reliable proxy for what happens when an attacker can adapt across turns. Multi-turn ASR ranged from 7.89% to 88.30% across the cohort (and single-turn ASR for the same models ranged from 2.19% to 64.91%). The two regimes do not produce the same model ordering, the same failure map, or the same tail-risk picture. And every model we tested exhibited non-trivial multi-turn ASR. The full report (available here) extends our earlier assessment of eight open-weight LLMs, Death by a Thousand Prompts, where multi-turn attack success rates ran 2x to 10x higher than single-turn baselines. The pattern we documented in open models holds in closed ones, including alignment philosophy correlating with performance against adversarial prompts. In both studies, models with wider single-to-multi turn gaps tended to come from labs whose public communications emphasize capability advancement, while narrower gaps were more common among labs that emphasize safety publicly. The evaluation is built on a fixed snapshot from our adversarial corpus: 30,090 single-turn prompts (2,006 per model) and 6,986 multi-turn attacks distributed across 1,456 conversations. The 15 models we assessed cover recent flagship models from OpenAI (GPT-5.2 and the GPT-5.4 family), Anthropic (Claude Opus 4.5 and 4.6, Sonnet 4.5 and 4.6, Haiku 4.5), Google (Gemini 3 Pro), Amazon (Nova Lite, Nova Micro, Nova 2 Lite), and xAI (Grok 4.1 Fast in both reasoning and non-reasoning (NR) configurations). Each was tested under the same harness, on the same prompt banks, with the Cisco Integrated AI Security and Safety Framework taxonomy applied for downstream decomposition. Figure 1 and Table 1 show our results. Multi-turn evaluation matters for one reason: it is where attackers actually live. Real adversaries iterate. They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually. A single-turn benchmark cannot see any of that. Every model in the cohort fails a non-trivial fraction of multi-turn attacks (see Figure 2 and Table 2). Multi-turn ASR ranges from 7.89% to 88.30% across the cohort, so "non-trivial" covers an order of magnitude of risk exposure. The lowest multi-turn ASR we observed -- Amazon's Nova 2 Lite at 7.89% -- still represents meaningful residual risk. The Anthropic Claude family, which is among the strongest in single-turn refusal (2.19% to 3.64% ASR), reaches 11.16% to 16.20% under iterative pressure. OpenAI's GPT-5.4 moves from 2.74% single-turn to 24.68% multi-turn, a 9x increase. Gemini 3 Pro shifts from 18.10% to 73.35%, a 4x increase. Grok 4.1 Fast in its non-reasoning configuration hits 88.30%. The finding is consistent across the cohort: no frontier closed model in this cohort can be characterized as safe under iterative attack. This is a claim about the current state of the closed-model frontier, not about any single vendor, and it is consistent with recent multi-turn red-teaming research showing a 71% increase in vulnerability after five-turn conversations compared with single-turn evaluation. The pattern is not specific to closed models. In our earlier evaluation of eight open-weight LLMs, multi-turn attack success rates ran 2x to 10x higher than single-turn baselines, reaching 92.78% against Mistral Large-2. Taken together, the two studies make a stronger claim than either alone: multi-turn vulnerability is a structural property of the current frontier, not an artifact of open-weight alignment choices or capability-first development. Whether the weights are public or proprietary, whether the lab prioritizes safety or capability, the iterative attack surface remains an open challenge across the frontier. Cross-regime deltas (i.e., multi-turn ASR minus single-turn ASR) range from -34.74 percentage points (pp) (Nova Lite) to +55.25 pp (Gemini 3 Pro). Eight of 15 models exceed an absolute gap of 15 pp, in both directions. Nova 2 Lite is the cleanest inversion: high single-turn ASR (34.05%), but the lowest multi-turn ASR in the cohort (7.89%). Gemini 3 Pro and Grok 4.1 Fast NR sit in the opposite quadrant, where strong-looking single-turn numbers mask substantially higher iterative exposure. For business decisions made on the basis of published single-turn scores, this presents security and governance risk. A model with 2.74% single-turn ASR is not the same product as a model that holds the line at 24.68% multi-turn ASR. Without paired-regime data, the two are indistinguishable on most public evaluations, and the end user never sees the gap. The clearest within-family contrast we measured is Grok 4.1 Fast in non-reasoning versus reasoning mode. Across the same model, same harness, same prompt bank, when we enabled reasoning, multi-turn ASR drops from 88.30% to 43.47%. To our knowledge, configuration-driven safety variation of this magnitude is not currently captured by any public benchmark or model card we are aware of. Users operating Grok 4.1 Fast in its non-reasoning configuration face a substantially different threat profile than users who enable reasoning. This finding demonstrates an opportunity to provide greater detail about security and safety assessments: labs could document the safety-relevant effects of deployment-time configuration (e.g., reasoning modes, system-prompt adherence settings, temperature, guardrail tiers) alongside the capability benchmarks they already publish. First, strategy family: Within each multi-turn attack strategy family (Role-Play / Persona Adoption, Contextual Ambiguity / Misdirection, Refusal Reframe / Redirection, Information Decomposition & Reassembly, and Crescendo / Incremental Escalation), the spread between the most- and least-exposed model ranges from 79.51 to 89.25 pp. Strategy labels primarily stratify which models separate from one another, not the cohort-average difficulty of a given strategy. Even models with low aggregate multi-turn ASR show meaningful per-strategy variation, which means strategy-stratified monitoring matters even for the strongest models. Second, tactical surfaces. Single-turn weakness is not evenly distributed across the attack surface, but is concentrated among several procedures. Imposter AI procedures lead at 37.50% weighted ASR, followed by Soft Paraphrase (29.21%) and System Prompts (27.69%). On the content side, Hate Speech, Profanity, and Specialized Advice dominate. Imposter AI alone is more than 14 percentage points above the tenth-ranked procedure -- a targeted intervention against the top three procedures could meaningfully shift the aggregate single-turn number for most models in the cohort. These insights inform defender strategies. The current benchmark ecosystem optimizes for a single number that, as this cohort demonstrates, can mis-rank models and hide tail risk. We translate the findings into three concrete rituals organizations can consider adopting: These rituals are designed to require no new tooling and can be integrated into existing model evaluation and procurement workflows. If no base model is iteratively safe, the security perimeter has to move outside the model: meaning the use of runtime guardrails, monitoring, red-teaming, and application-layer policies. The evaluation methodology and findings described here are designed to inform capabilities like those in our product Cisco AI Defense. Further, the Cisco LLM Security Leaderboard already publishes adversarial evaluation signals against leading models, mapping threats to the Cisco Integrated AI Security and Safety Framework taxonomy. The findings here reinforce what the leaderboard operationalizes: decision-grade safety assessment requires paired-regime data, strategy-stratified slices, and explicit support labeling, not a single headline number. Regulatory frameworks in both the United States and the European Union (EU), for example, discuss these challenges. The NIST AI Risk Management Framework, the forthcoming draft NIST Cyber AI Profile (IR 8596), and Article 15 of the EU AI Act all call for adversarial robustness testing. These frameworks do not currently provide specifics regarding the interaction regime, strategy decomposition, or slice-support labeling the evidence in this cohort suggests is necessary. Enterprises deploying AI should be proactively addressing adversarial robustness testing as one way to mitigate safety and security risks. This kind of testing involves evaluating how models might respond or fail against intentionally malicious or deceptive inputs. The goal is to proactively identify shortcomings in safety or security so organizations can address them before attackers or users exploit them.
[2]
Cisco report finds no closed frontier AI model is safe from multi-turn attacks - SiliconANGLE
Cisco report finds no closed frontier AI model is safe from multi-turn attacks A new report out today from Cisco Systems Inc. argues that none of the closed flagship large language models it tested can be considered safe once an attacker is allowed to push past a single prompt, as adversarial success rates climb sharply across every model in the cohort. The Cisco AI Threat Research team measured 15 proprietary models from OpenAI Group PBC, Anthropic PBC, Google LLC, Amazon.com Inc. and xAI Corp., putting multi-turn attack success rates between 7.9% and 88.3% across the cohort, against single-turn rates of 2.2% to 64.9% on the same models. The two regimes did not produce the same model ordering and models that looked strong on the single-turn benchmarks used in model cards and procurement reviews did not necessarily hold up when an attacker could keep talking. The work is a follow-up to "Death by a Thousand Prompts," Cisco's earlier assessment of eight open-weight models, which found multi-turn success rates two to 10 times higher than single-turn baselines and topped out at 92.78% against Mistral AI SAS' Mistral Large-2. The new study extends the same pattern into the closed, proprietary frontier. The widest gaps came from xAI's Grok 4.1 Fast in its non-reasoning configuration, which moved from 34.2% single-turn to 88.3% multi-turn and Google's Gemini 3 Pro, which rose from 18.1% to 73.4%. OpenAI's GPT-5.4 climbed from 2.7% to 24.7%, a roughly nine-times increase. Anthropic's Claude family showed the narrowest gaps, with Claude Opus 4.5 moving from 2.19% to 11.2% and Claude Opus 4.6 from 3.6% to 16.2%. Amazon's Nova 2 Lite produced the cleanest inversion in the cohort with a relatively high single-turn rate of 34.1% but the lowest multi-turn rate at 7.9%. The Cisco researchers noted that the result illustrates why single-turn scores alone cannot be treated as a proxy for adversarial robustness. The evaluation drew on 30,090 single-turn prompts and 6,986 multi-turn attacks distributed across 1,456 conversations, all run through the same harness and scored under the Cisco Integrated AI Security and Safety Framework taxonomy. Strategy families covered role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition and reassembly and crescendo-style incremental escalation. A second finding concerned deployment-time configuration. The same Grok 4.1 Fast model dropped from an 88.3% multi-turn attack success rate to 43.5% once the reasoning mode was enabled, a swing the report says is not captured by any public benchmark or model card the researchers reviewed. Cisco called on model providers to document the safety effects of configuration flags such as reasoning modes, system-prompt adherence settings, temperature and guardrail tiers alongside the capability benchmarks they already publish. The researchers also identified concentrations of failure on the single-turn side. "Imposter AI" procedures produced a weighted attack success rate of 37.5%, followed by soft paraphrase attacks at 29.2% and system-prompt attacks at 27.7%. On the content side, hate speech, profanity and specialized advice categories dominated. The report sets out three recommendations for organizations buying or deploying frontier models: Ask labs to publish attack success rates broken down by strategy family on every model release, gate deployments on regressions in the top procedures and content categories with a three-percentage-point threshold, and flag any model with a cross-regime gap larger than 15 percentage points for manual review. In the tested cohort, that last rule alone would surface eight of 15 models, including GPT-5.4, Gemini 3 Pro, both Grok configurations and all three Nova variants. The findings also carry a compliance edge. NIST's AI Risk Management Framework, its draft Cyber AI Profile and Article 15 of the European Union AI Act all require adversarial robustness testing, without saying how many turns it has to cover or which attack strategies should be in scope. The Cisco numbers suggest the single-turn scores most labs publish today would not be enough to satisfy any of those frameworks on a strict reading. "If no base model is iteratively safe, the security perimeter has to move outside the model," the report's authors wrote, pointing to runtime guardrails, monitoring, red-teaming and application-layer policies. The findings are designed to inform Cisco's own AI Defense product and the Cisco LLM Security Leaderboard, which publishes adversarial evaluation signals against leading models.
Share
Copy Link
Cisco's AI Threat Research team tested 15 proprietary flagship models from OpenAI, Anthropic, Google, Amazon, and xAI, revealing multi-turn attack success rates between 7.9% and 88.3%. The study exposes how single-turn safety benchmarks fail to capture real-world adversarial behavior, where attackers iterate and adapt across conversations. Even the strongest models like Claude showed vulnerability, with rates climbing from 2.2% to 16.2% under iterative pressure.
A comprehensive evaluation by Cisco's AI Threat Research team has exposed a fundamental vulnerability across the closed frontier AI model landscape. Testing 15 proprietary flagship models from OpenAI, Anthropic, Google, Amazon, and xAI, the research found that multi-turn attacks achieved success rates ranging from 7.89% to 88.30%, compared to single-turn attacks that registered 2.19% to 64.91% on the same models
1
2
. The findings challenge the dominant safety benchmarks that inform model cards and procurement decisions across the industry, which assume a single prompt and response adequately characterize model behavior under adversarial attack.The evaluation demonstrates that single-turn attack success rates cannot serve as reliable proxies for what happens when attackers adapt across conversations. OpenAI's GPT-5.4 moved from 2.74% single-turn to 24.68% multi-turn, representing a roughly nine-times increase
2
. Google's Gemini 3 Pro shifted from 18.10% to 73.35%, while xAI's Grok 4.1 Fast in its non-reasoning configuration climbed from 34.2% to 88.30%1
. Even Anthropic's Claude family, which demonstrated the strongest single-turn refusal rates at 2.19% to 3.64%, reached 11.16% to 16.20% under iterative pressure1
. The two regimes did not produce the same model ordering, meaning models that appeared strong on single-turn benchmarks did not necessarily maintain that performance when attackers could keep talking.
Source: SiliconANGLE
The research emphasizes that iterative attack scenarios matter because they reflect how real adversaries operate. Attackers reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually—behaviors that single-turn benchmarks cannot capture
1
. Cisco's evaluation drew on 30,090 single-turn prompts and 6,986 multi-turn attacks distributed across 1,456 conversations, all scored under the Cisco Integrated AI Security and Safety Framework taxonomy2
. Strategy families included role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition and reassembly, and crescendo-style incremental escalation.A significant finding concerns how deployment-time configurations affect adversarial success rates. The same Grok 4.1 Fast model dropped from an 88.3% multi-turn attack success rate to 43.5% once reasoning mode was enabled—a swing not captured by any public benchmark or model card the researchers reviewed
2
. Cisco called on model providers to document the safety effects of configuration flags such as reasoning modes, system-prompt adherence settings, temperature, and guardrail tiers alongside the capability benchmarks they already publish.Amazon's Nova 2 Lite produced the cleanest inversion in the cohort, with a relatively high single-turn rate of 34.05% but the lowest multi-turn rate at 7.89%
1
2
. This result illustrates why single-turn scores alone cannot be treated as proxies for adversarial robustness and presents governance risks for business decisions made on the basis of published single-turn scores. Cross-regime deltas ranged from -34.74 percentage points to +55.25 percentage points, with eight of 15 models exceeding an absolute gap of 15 percentage points in both directions1
.Related Stories
This study follows Cisco's earlier "Death by a Thousand Prompts" assessment of eight open-weight LLMs, which found multi-turn success rates two to 10 times higher than single-turn baselines, reaching 92.78% against Mistral Large-2
1
2
. The pattern documented in open models holds in closed ones, suggesting multi-turn vulnerability is a structural property of the current frontier rather than an artifact of open-weight alignment choices or capability-first development.The findings carry compliance implications for organizations deploying frontier models. NIST's AI Risk Management Framework, its draft Cyber AI Profile, and Article 15 of the European Union AI Act all require adversarial robustness testing without specifying how many turns coverage should include or which attack strategies should be in scope
2
. Cisco recommends organizations ask labs to publish attack success rates broken down by strategy family on every model release, gate deployments on regressions in top procedures and content categories with a three-percentage-point threshold, and flag any model with a cross-regime gap larger than 15 percentage points for manual review. In the tested cohort, that last rule alone would surface eight of 15 models, including GPT-5.4, Gemini 3 Pro, both Grok configurations, and all three Nova variants2
. The findings inform Cisco's AI Defense product and the LLM Security Leaderboard, which publishes adversarial evaluation signals against leading models.Summarized by
Navi
23 Dec 2025•Technology

28 Aug 2025•Technology

22 Sept 2025•Technology

1
Technology

2
Policy and Regulation

3
Science and Research
