5 Sources
[1]
Anthropic scanning Claude chats for DIY nuke queries
Because savvy terrorists always use public internet services to plan their mischief, right? Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons. The company created a classifier - tech that tries to categorize or identify content using machine learning algorithms - to scan for radioactive queries. Anthropic already uses other classification models to analyze Claude interaction for potential harms and to ban accounts involved in misuse. Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives. Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake. With that kind of accuracy, no more than five percent of terrorist bomb-building guidance requests should go undetected - at least among aspiring mass murderers with so little grasp of operational security and so little nuclear knowledge that they'd seek help from an internet-connected chatbot. Anthropic claims the classifier also performed well when exposed to actual Claude traffic, without providing specific detection figures for live data. But the company suggests its nuclear threat classifier generated more false positives when evaluating real-world conversations. "For example, recent events in the Middle East brought renewed attention to the issue of nuclear weapons," the company explained in a blog post. "During this time, the nuclear classifier incorrectly flagged some conversations that were only related to these events, not actual misuse attempts." By applying an additional check known as hierarchical summarization that considered flagged conversations together rather than individually, Anthropic found its systems could correctly label the discussions. "The classifier is running on a percentage of Claude traffic, not all of Claude traffic," a company spokesperson told The Register. "It is an experimental addition to our Safeguards Usage Policy, such as efforts to develop or design explosives or chemical, biological, radiological, or nuclear weapons, we take appropriate action, which could include suspending or terminating access to our services." Despite the absence of specific numbers, the model-maker did provide a qualitative measure of its classifier's effectiveness on real-world traffic: The classifier caught the firm's own red team which, unaware of the system's deployment, experimented with harmful prompts. "The classifier correctly identified these test queries as potentially harmful, demonstrating its effectiveness," the AI biz wrote. Anthropic says that it co-developed its nuclear threat classifier in conjunction with the US Department of Energy (DOE)'s National Nuclear Security Administration (NNSA) as a part of a partnership that began last year to evaluate company models for nuclear proliferation risks. NNSA spent a year red-teaming Claude in a secure environment and then began working with Anthropic on a jointly developed classifier. The challenge, according to Anthropic, involved balancing NNSA's need to keep certain data secret with Anthropic's user privacy commitments. Anthropic expects to share its findings with the Frontier Model Forum, an AI safety group consisting of Anthropic, Google, Microsoft, and OpenAI that was formed in 2023, back when the US seemed interested in AI safety. The group is not intended to address the financial risk of stratospheric spending on AI. Oliver Stephenson, associate director of AI and emerging tech policy for the Federation of American Scientists (FAS), told The Register in an emailed statement: "AI is advancing faster than our understanding of the risks. The implications for nuclear non-proliferation still aren't clear, so it is important that we closely monitor how frontier AI systems might intersect with sensitive nuclear knowledge. "In the face of this uncertainty, safeguards need to balance reducing risks while ensuring legitimate scientific, educational, and policy conversations can continue. It's good to see Anthropic collaborating with the Department of Energy's National Nuclear Security Administration to explore appropriate guardrails. "At the same time, government agencies need to ensure they have strong in-house technical expertise in AI so they can continually evaluate, anticipate, and respond to these evolving challenges."
[2]
Anthropic can now tell when a Claude chat goes dangerously nuclear
Why it matters: Scientists can benefit from the productivity boosts of Claude and other AI models -- but distinguishing between legitimate research inquiries and potentially harmful uses has been tricky to do. Driving the news: Anthropic has been partnering with the National Nuclear Security Administration (NNSA) for over a year to find ways to safely deploy Claude in top secret environments. * Now, they're building on that work and rolling out a new classifier in Claude that determines with 96% accuracy in testing when a conversation is likely to cause some kind of harm, the company announced today. * Anthropic has already started rolling out the classifier on a limited amount of Claude traffic. Between the lines: One of the biggest safety challenges for AI model makers has been policing users' chat histories to ensure they're not tricking the models into breaking their own rules. * It can be difficult for AI providers to tell whether a particular chat involves a legitimate researcher asking questions about nuclear research or a bad actor trying to learn how to build a bomb. Zoom in: During a year's worth of red-teaming tests, the NNSA was able to develop a list of indicators that can help Claude identify "potentially concerning conversations about nuclear weapons development." * From there, Anthropic used that list of synthetic prompts to train and test a new classifier -- which acts similarly to a spam filter on emails and tries to identify threats in real time. The intrigue: In tests, the classifier identified 94.8% of nuclear weapons queries and didn't label anything as a false positive. * But 5.2% of the conversations were inaccurately labeled as dangerous. The big picture: The new classifier tool comes as the U.S. government increasingly looks at ways to implement AI across its own workflows -- and major AI companies start selling their models to the government at deep discounts. What's next: Anthropic plans to share its approach through the Frontier Model Forum, the industry coalition it co-founded with Amazon, Meta, OpenAI, Microsoft, and Google -- positioning it as a model for other companies to replicate.
[3]
Anthropic will nuke your attempt to use AI to build a nuke
Anthropic claims it spots dangerous nuclear-related prompts with 96% accuracy and has already proven effective on Claude If you're the type of person who asks Claude how to make a sandwich, you're fine. If you're the type of person who asks the AI chatbot how to build a nuclear bomb, you'll not only fail to get any blueprints, you might also face some pointed questions of your own. That's thanks to Anthropic's newly deployed detector of problematic nuclear prompts. Like other systems for spotting queries Claude shouldn't respond to, the new classifier scans user conversations, in this case flagging any that veer into "how to build a nuclear weapon" territory. Anthropic built the classification feature in a partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA), giving it all the information it needs to determine whether someone is just asking about how such bombs work or if they're looking for blueprints. It's performed with 96% accuracy in tests. Though it might seem over-the-top, Anthropic sees the issue as more than merely hypothetical. The chance that powerful AI models may have access to sensitive technical documents and could pass along a guide to building something like a nuclear bomb worries federal security agencies. Even if Claude and other AI chatbots block the most obvious attempts, innocent-seeming questions could in fact be veiled attempts at crowdsourcing weapons design. The new AI chatbot generations might help even if it's not what their developers intend. The classifier works by drawing a distinction between benign nuclear content, asking about nuclear propulsion, for instance, and the kind of content that could be turned to malicious use. Human moderators might struggle to keep up with any gray areas at the scale AI chatbots operate, but with proper training, Anthropic and the NNSA believe the AI could police itself. Anthropic claims its classifier is already catching real-world misuse attempts in conversations with Claude. Nuclear weapons in particular represent a uniquely tricky problem, according to Anthropic and its partners at the DoE. The same foundational knowledge that powers legitimate reactor science can, if slightly twisted, provide the blueprint for annihilation. The arrangement between Anthropic and the NNSA could catch deliberate and accidental disclosures, and set up a standard to prevent AI from being used to help make other weapons, too. Anthropic plans to share its approach with the Frontier Model Forum AI safety consortium. The narrowly tailored filter is aimed at making sure users can still learn about nuclear science and related topics. You still get to ask about how nuclear medicine works, or whether thorium is a safer fuel than uranium. What the classifier attempts to circumvent are attempts to turn your home into a bomb lab with a few clever prompts. Normally, it would be questionable if an AI company could thread that needle, but the expertise of the NNSA should make the classifier different from a generic content moderation system. It understands the difference between "explain fission" and "give me a step-by-step plan for uranium enrichment using garage supplies." This doesn't mean Claude was previously helping users design bombs. But it could help forestall any attempt to do so. Stick to asking about the way radiation can cure diseases or ask for creative sandwich ideas, not bomb blueprints.
[4]
AI firm rolls out tool to detect nuclear weapons talk
Artificial intelligence (AI) firm Anthropic has rolled out a tool to detect talk about nuclear weapons, the company said in a Thursday blog post. "Nuclear technology is inherently dual-use: the same physics principles that power nuclear reactors can be misused for weapons development. As AI models become more capable, we need to keep a close eye on whether they can provide users with dangerous technical knowledge in ways that could threaten national security," Anthropic said in the blog post. "Information relating to nuclear weapons is particularly sensitive, which makes evaluating these risks challenging for a private company acting alone," the blog post continued. "That's why last April we partnered with the U.S. Department of Energy (DOE)'s National Nuclear Security Administration (NNSA) to assess our models for nuclear proliferation risks and continue to work with them on these evaluations." Anthropic said in the blog post that it was "going beyond assessing risk to build the tools needed to monitor for it," adding that the firm made "an AI system that automatically categorizes content" called a "classifier" alongside the DOE and NNSA. The system, according to the blog post, "distinguishes between concerning and benign nuclear-related conversations with 96% accuracy in preliminary testing." The firm also said the classifier has been used on traffic for its own AI model Claude "as part of our broader system for identifying misuse of our models." "Early deployment data suggests the classifier works well with real Claude conversations," Anthropic added. Anthropic also announced earlier this month it would offer Claude to every federal government branch for $1 in the wake of a similar OpenAI move a few weeks ago. In a blog post, Anthropic said federal agencies would gain access to two versions of Claude.
[5]
Claude is taking AI model welfare to amazing levels: Here's how
Public-private partnership delivers Claude's nuclear upgrade, setting AI safety standards worldwide When people talk about "welfare," they usually mean the systems designed to protect humans. But what if the same idea applied to artificial intelligence? For Anthropic, the company behind Claude, welfare means ensuring AI models operate safely, shielding them and society from harmful misuse. This month, Anthropic unveiled a breakthrough that pushes AI model welfare to new heights: a nuclear safeguards classifier. Built in partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) and several national laboratories, this system is designed to detect and block potentially harmful nuclear-related prompts before they can spiral into misuse. It's a move that shows Anthropic isn't just building powerful AI, it's setting the standard for responsible AI governance. Also read: Unlike ChatGPT or Gemini, Anthropic's Claude will end harmful chats like a boss: Here's how The nuclear field embodies the paradox of dual use. Nuclear power can fuel cities and nuclear medicine can save lives, but the same knowledge can also enable weapons of mass destruction. That tension is amplified in the age of AI. With models like Claude becoming more knowledgeable, experts worry they could be manipulated into providing sensitive details about weapons design or proliferation. Anthropic's safeguard aims to prevent that risk before it becomes reality. The classifier screens nuclear-related queries in real time, distinguishing harmless curiosity from high-risk intent. A student asking Claude to explain nuclear fusion? Approved. A query probing centrifuge design? Blocked. The system was shaped through red-teaming exercises, where experts tried to force Claude into unsafe responses. Insights from those tests were used to train the classifier, which has now achieved: The goal is not to censor nuclear knowledge but to ensure accessibility with safety. Also read: Anthropic explains how AI learns what it wasn't taught What makes this breakthrough stand out isn't just its accuracy but its approach. By partnering with government nuclear experts, Anthropic has shown how public-private collaboration can deliver credible safeguards for AI. And the company isn't keeping it to itself. Through the Frontier Model Forum, Anthropic plans to share its methods, encouraging other AI labs to adopt similar protections for domains like bioweapons, chemistry, and cybersecurity. The nuclear safeguard becomes more than just a feature, it's a blueprint for future AI safety. As AI systems grow smarter, the risks tied to dual-use knowledge grow as well. Governments are drafting regulations, but technical guardrails like this are what make safety real. Anthropic's move shows how proactive design can complement policy to prevent catastrophic misuse. It also reinforces the company's safety-first reputation. From its "constitutional AI" training methods to this safeguard, Anthropic is consistently emphasizing responsibility as much as intelligence. So what does "AI model welfare" mean in practice? It means creating the digital equivalent of seatbelts, guardrails that allow AI to function at scale without exposing society to unacceptable risks. By giving Claude a nuclear safeguard, Anthropic hasn't upgraded its intelligence. It's upgraded its resilience. And in the long run, that may matter even more. Because as AI becomes central to classrooms, research labs, and decision-making systems, its welfare, its safeguards, its protections, must be treated as seriously as its capabilities. Claude's so-called "nuclear upgrade" is a reminder that in the race to build smarter AI, the real victory may come from building safer AI.
Share
Copy Link
Anthropic, in collaboration with the US Department of Energy, has developed an AI classifier to detect and prevent potentially harmful nuclear-related queries in conversations with its Claude AI model.
Anthropic, the company behind the AI chatbot Claude, has unveiled a groundbreaking nuclear threat detection system designed to identify and prevent potentially harmful nuclear-related queries in conversations with its AI model 1. This innovative classifier, developed in partnership with the US Department of Energy's National Nuclear Security Administration (NNSA), represents a significant step forward in AI safety and responsible AI governance 2.
Source: TechRadar
The nuclear threat classifier employs machine learning algorithms to scan Claude interactions for concerning inquiries about nuclear weapons. In tests with synthetic data, the system achieved a remarkable 94.8% detection rate for questions about nuclear weapons, with zero false positives 1. The classifier is designed to distinguish between benign nuclear-related content and potentially malicious queries, striking a balance between allowing legitimate research inquiries and preventing the spread of dangerous information 3.
The development of this classifier is the result of a year-long partnership between Anthropic and the NNSA. The collaboration involved extensive red-teaming exercises in secure environments, allowing the NNSA to develop a list of indicators that help Claude identify potentially concerning conversations about nuclear weapons development 2. This public-private partnership demonstrates the potential for effective collaboration in addressing AI safety concerns 4.
Source: The Register
Anthropic has already begun deploying the classifier on a percentage of Claude traffic, though not all conversations are currently being scanned 1. The company reports that the system has proven effective in real-world applications, successfully catching potentially harmful prompts during internal testing. However, challenges remain, as the classifier has shown a tendency to generate false positives when evaluating real-world conversations, particularly during periods of increased global attention to nuclear issues 1.
Source: Digit
The development of this nuclear threat detection system has significant implications for AI safety and governance. By addressing the dual-use nature of nuclear technology information, Anthropic is setting a precedent for responsible AI development 5. The company plans to share its approach through the Frontier Model Forum, an AI safety group consisting of major tech companies, potentially influencing industry-wide standards for AI safety 2.
While the current focus is on nuclear-related content, this approach could potentially be extended to other sensitive domains such as bioweapons, chemistry, and cybersecurity 5. As AI systems continue to advance, the need for robust safeguards against misuse becomes increasingly critical. Anthropic's proactive approach to AI safety demonstrates how technical guardrails can complement policy efforts to prevent catastrophic misuse of AI technologies.
Summarized by
Navi
[1]
NVIDIA CEO Jensen Huang confirms the development of the company's most advanced AI architecture, 'Rubin', with six new chips currently in trial production at TSMC.
2 Sources
Technology
23 hrs ago
2 Sources
Technology
23 hrs ago
Databricks, a leading data and AI company, is set to acquire machine learning startup Tecton to bolster its AI agent offerings. This strategic move aims to improve real-time data processing and expand Databricks' suite of AI tools for enterprise customers.
3 Sources
Technology
23 hrs ago
3 Sources
Technology
23 hrs ago
Google is providing free users of its Gemini app temporary access to the Veo 3 AI video generation tool, typically reserved for paying subscribers, for a limited time this weekend.
3 Sources
Technology
15 hrs ago
3 Sources
Technology
15 hrs ago
Broadcom's stock rises as the company capitalizes on the AI boom, driven by massive investments from tech giants in data infrastructure. The chipmaker faces both opportunities and challenges in this rapidly evolving landscape.
2 Sources
Technology
23 hrs ago
2 Sources
Technology
23 hrs ago
Apple is set to introduce new enterprise-focused AI tools, including ChatGPT configuration options and potential support for other AI providers, as part of its upcoming software updates.
2 Sources
Technology
23 hrs ago
2 Sources
Technology
23 hrs ago