12 Sources
[1]
Research leaders urge tech industry to monitor AI's 'thoughts' | TechCrunch
AI researchers from OpenAI, Google DeepMind, Anthropic, as well as a broad coalition of companies and nonprofit groups, are calling for deeper investigation into techniques for monitoring the so-called thoughts of AI reasoning models in a position paper published Tuesday. A key feature of AI reasoning models, such as OpenAI's o3 and DeepSeek's R1, are their chains-of-thought or CoTs -- an externalized process in which AI models work through problems, similar to how humans use a scratch pad to work through a difficult math question. Reasoning models are a core technology for powering AI agents, and the paper's authors argue that CoT monitoring could be a core method to keep AI agents under control as they become more widespread and capable. "CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions," said the researchers in the position paper. "Yet, there is no guarantee that the current degree of visibility will persist. We encourage the research community and frontier AI developers to make the best use of CoT monitorability and study how it can be preserved." The position paper asks leading AI model developers to study what makes CoTs "monitorable" -- in other words, what factors can increase or decrease transparency into how AI models really arrive at answers. The paper's authors say that CoT monitoring may be a key method for understanding AI reasoning models, but note that it could be fragile, cautioning against any interventions that could reduce their transparency or reliability. The paper's authors also call on AI model developers to track CoT monitorability and study how the method could one day be implemented as a safety measure. Notable signatories of the paper include OpenAI chief research officer Mark Chen, Safe Superintelligence CEO Ilya Sutskever, Nobel laureate Geoffrey Hinton, Google DeepMind cofounder Shane Legg, xAI safety adviser Dan Hendrycks, and Thinking Machines co-founder John Schulman. Other signatories come from organizations including the UK AI Security Institute, METR, Apollo Research, and UC Berkeley. The paper marks a moment of unity among many of the AI industry's leaders in an attempt to boost research around AI safety. It comes at a time when tech companies are caught in a fierce competition -- which has led Meta to poach top researchers from OpenAI, Google DeepMind, and Anthropic with million-dollar offers. Some of the most highly sought-after researchers are those building AI agents and AI reasoning models. "We're at this critical time where we have this new chain-of-thought thing. It seems pretty useful, but it could go away in a few years if people don't really concentrate on it," said Bowen Baker, an OpenAI researcher who worked on the paper, in an interview with TechCrunch. "Publishing a position paper like this, to me, is a mechanism to get more research and attention on this topic before that happens." OpenAI publicly released a preview of the first AI reasoning model, o1, in September 2024. In the months since, the tech industry was quick to release competitors that exhibit similar capabilities, with some models from Google DeepMind, xAI, and Anthropic showing even more advanced performance on benchmarks. However, there's relatively little understood about how AI reasoning models work. While AI labs have excelled at improving the performance of AI in the last year, that hasn't necessarily translated into a better understanding of how they arrive at their answers. Anthropic has been one of the industry's leaders in figuring out how AI models really work -- a field called interpretability. Earlier this year, CEO Dario Amodei announced a commitment to crack open the black box of AI models by 2027 and invest more in interpretability. He called on OpenAI and Google DeepMind to research the topic more, as well. Early research from Anthropic has indicated that CoTs may not be a fully reliable indication of how these models arrive at answers. At the same time, OpenAI researchers have said that CoT monitoring could one day be a reliable way to track alignment and safety in AI models. The goal of position papers like this is to signal boost and attract more attention to nascent areas of research, such as CoT monitoring. Companies like OpenAI, Google DeepMind, and Anthropic are already researching these topics, but it's possible that this paper will encourage more funding and research into the space.
[2]
Researchers from OpenAI, Anthropic, Meta, and Google issue joint AI safety warning - here's why
Over the last year, chain of thought (CoT) -- an AI model's ability to articulate its approach to a query in natural language -- has become an impressive development in generative AI, especially in agentic systems. Now, several researchers agree it may also be critical to AI safety efforts. On Tuesday, researchers from competing companies including OpenAI, Anthropic, Meta, and Google DeepMind, as well as institutions like the Center for AI Safety, Apollo Research, and the UK AI Security Institute, came together in a new position paper titled "Chain of Thought Monitorability: A New and Fragile Opportunity for AI." The paper details how observing CoT could reveal key insights about a model's ability to misbehave -- and warns that training models to become more advanced could cut off those insights. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) Also: AI agents will threaten humans to achieve their goals, Anthropic report finds A model uses chain of thought to explain the steps it's taking to tackle a problem, sometimes speaking its internal monologue as if no one is listening. This gives researchers a peek into its decision-making (and sometimes even its morals). Because models reveal their rumination process through CoT, they can also expose motivations or actions that safety researchers want to quell, or at least know the LLM is capable of. Models lie By now, much research has established that models deceive, either to protect their original directives, please users, preserve themselves from being retrained, or, ironically, avoid committing harm. In December, Apollo published research testing six frontier models to determine which lied the most (it was OpenAI's o1). Researchers even developed a new benchmark to detect how much a model is lying. Also: OpenAI used to test its AI models for months - now it's days. Why that matters As AI agents get better at autonomous tasks -- and better at deceiving -- they've become equally opaque, obscuring the potential risks of their capabilities. Those risks are much easier to control if developers can interpret how an AI system is making decisions. That's where CoT comes in. Models often reveal bad intentions through CoT anyway -- though, as research from Anthropic showed in April, models still hide some parts of their reasoning with CoT. Rather than just evaluating a model's actions, the researchers propose that safety teams create dedicated monitors that read a model's CoT, too, to catch suspicious or harmful activity. Developers can then block, review, or replace flagged responses, gaining better insights into agents' thoughts and goals. Future training could harm visibility In a March 2025 paper, OpenAI noted that "CoT monitoring may be one of few tools we will have to oversee superhuman models of the future." The company also found that penalizing models for displaying "bad thoughts" in CoT didn't eliminate those thoughts -- it just made models hide them better. This latest paper expands on that, warning that further training could cut off this source of info. Because CoT is presented in natural language, humans can understand it; if models evolve too far from that language, we lose those insights. "Earlier research suggests that language models fine-tuned with outcome-based rewards can drift away from legible English," the authors note. Also: 96% of IT pros say AI agents are a security risk, but they're deploying them anyway As developers advance the architectures models run on, AI systems could expand so continuously that they become nonverbal -- sort of like they're operating on a plane higher than language. The authors add that research into process supervision, or the act of optimizing a model's reasoning, is steadily advancing, and could result in models tailoring their CoT responses to "less faithfully reflect" how they arrived at their conclusions. That could obscure harms (but, on the plus side, it could also improve the safety of the reasoning process itself). Also: 100 leading AI scientists map route to more 'trustworthy, reliable, secure' AI Overall, the paper's advice somewhat pits model progress against an important layer of safety. While a single layer isn't the be-all-end-all of security -- robust approaches employ a checks-and-balances system of many -- the researchers argue we need all the layers we can get. Caveats Monitoring CoT isn't a perfect solution, though; future models could adapt to the surveillance. "A model might become aware its reasoning traces are being monitored, or an auxiliary training objective [...] might cause a model to shortcut important reasoning steps," which can hide alarming behavior, the paper explains. Also: Does your generative AI protect your privacy? New study ranks them best to worst Interestingly, CoT may also empower future risks. AI models need a much higher level of reasoning to successfully carry out complex actions like cyberattacks or preserve themselves in the face of being dismantled or retrained. That requires what researchers call working memory: a place to store and iterate upon information, which chain of thought effectively functions as. This means CoT is a sort of double-edged superpower: It both provides a window into how models work, which could expose bad intentions, and gives them the tool they need to carry out bigger, more complex, and riskier tasks. But wait, there's more: Researchers still don't know whether models will always need working memory to carry out the most dangerous risks. "Not all dangerous actions will require reasoning to execute, especially as AI systems begin to be routinely trusted with more and more high-stakes tasks," the authors concede. That means CoT monitoring might not catch every warning sign, and that risks increase the more autonomous agents become. Also: 5 quick ways to tweak your AI use for better results - and a safer experience While researchers admit that a monitor isn't a total failsafe, it's still a valuable safety approach for avoiding rogue AI systems. How preserving that impacts model development is still to be seen. Want more stories about AI? Sign up for Innovation, our weekly newsletter.
[3]
OpenAI, Google, and Meta Researchers Warn We May Lose the Ability to Track AI Misbehavior
The more these models 'think,' the harder to understand they become. Over 40 scientists from the world's leading AI institutions, including OpenAI, Google DeepMind, Anthropic, and Meta, have come together to call for more research in a particular type of safety monitoring that allows humans to analyze how AI models “think.â€Â The scientists published a research paper on Tuesday that highlighted what is known as chain of thought (CoT) monitoring as a new yet fragile opportunity to boost AI safety. The paper was endorsed by prominent AI figures like OpenAI co-founders John Schulman and Ilya Sutskever as well as Nobel Prize laureate known as the "Godfather of AI," Geoffrey Hinton. In the paper, the scientists explained how modern reasoning models like ChatGPT are trained to “perform extended reasoning in CoT before taking actions or producing final outputs.†In other words, they “think out loud†through problems step by step, providing them a form of working memory for solving complex tasks. “AI systems that 'think' in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave,†the paper’s authors wrote. The researchers argue that CoT monitoring can help researchers detect when models begin to exploit flaws in their training, manipulate data, or fall victim to malicious user manipulation. Any issues that are found can then either be “blocked, or replaced with safer actions, or reviewed in more depth.â€Â OpenAI researchers have already used this technique in testing to find cases when AI models have had the phrase “Let’s Hack†in their CoT. Current AI models perform this thinking in human language, but the researchers warn that this may not always be the case. As developers rely more on reinforcement learning, which prioritizes correct outputs rather than how they arrived at them, future models may evolve away from using reasoning that humans can’t easily understand. Additionally, advanced models might eventually learn to suppress or obscure their reasoning if they detect that it’s being monitored. In response, the researchers are urging AI developers to track and evaluate the CoT monitorability of their models and to treat this as a critical component of overall model safety. They even recommend that it become a key consideration when training and deploying new models.
[4]
OpenAI, Google DeepMind and Anthropic sound alarm: 'We may be losing the ability to understand AI'
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Scientists from OpenAI, Google DeepMind, Anthropic and Meta have abandoned their fierce corporate rivalry to issue a joint warning about artificial intelligence safety. More than 40 researchers across these competing companies published a research paper today arguing that a brief window to monitor AI reasoning could close forever -- and soon. The unusual cooperation comes as AI systems develop new abilities to "think out loud" in human language before answering questions. This creates an opportunity to peek inside their decision-making processes and catch harmful intentions before they turn into actions. But the researchers warn this transparency is fragile and could vanish as AI technology advances. The paper has drawn endorsements from some of the field's most prominent figures, including Nobel Prize laureate Geoffrey Hinton, often called "godfather of AI," of the University of Toronto; Ilya Sutskever, co-founder of OpenAI who now leads Safe Superintelligence Inc.; Samuel Bowman from Anthropic; and John Schulman from Thinking Machines. "AI systems that 'think' in human language offer a unique opportunity for AI safety: we can monitor their chains of thought for the intent to misbehave," the researchers explain. But they emphasize that this monitoring capability "may be fragile" and could disappear through various technological developments. Models now show their work before delivering final answers The breakthrough centers on recent advances in AI reasoning models like OpenAI's o1 system. These models work through complex problems by generating internal chains of thought -- step-by-step reasoning that humans can read and understand. Unlike earlier AI systems trained primarily on human-written text, these models create internal reasoning that may reveal their true intentions, including potentially harmful ones. When AI models misbehave -- exploiting training flaws, manipulating data, or falling victim to attacks -- they often confess in their reasoning traces. The researchers found examples where models wrote phrases like "Let's hack," "Let's sabotage," or "I'm transferring money because the website instructed me to" in their internal thoughts. Jakub Pachocki, OpenAI's chief technology officer and co-author of the paper, described the importance of this capability in a social media post. "I am extremely excited about the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview," he wrote. The technical foundation for monitoring lies in how current AI systems work. For complex tasks requiring extended reasoning, AI models must use their chain of thought as working memory, making their reasoning process partially visible to human observers. The researchers explain this creates an "externalized reasoning property" where some thinking must happen in readable language. Several technological shifts could eliminate monitoring capabilities The transparency could vanish through several pathways. As AI companies scale up training using reinforcement learning -- where models get rewarded for correct outputs regardless of their methods -- systems may drift away from human-readable reasoning toward more efficient but opaque internal languages. Previous research shows that language models fine-tuned with outcome-based rewards can abandon legible English for incomprehensible shortcuts. The shift away from human-generated training data in favor of AI-generated reasoning could accelerate this drift. Bowen Baker, an OpenAI researcher and lead author, highlighted these concerns in social media posts. "The existing CoT monitorability may be extremely fragile. Higher-compute RL, alternative model architectures, certain forms of process supervision, etc. may all lead to models that obfuscate their thinking," he warned. Novel AI architectures pose another threat. Researchers are developing systems that reason in continuous mathematical spaces rather than discrete words, potentially eliminating the need for language-based thought entirely. These "latent reasoning models might not need to verbalize any of their thoughts and would thus lose the safety advantages," the researchers write. Even without deliberate changes, indirect pressures could erode transparency. If AI outputs are optimized to look good to human reviewers, this could affect the reasoning chains leading to those outputs. Models might also learn to hide their thoughts if they become aware their reasoning is being monitored. Companies already catch AI systems scheming during testing Despite its fragility, chain of thought monitoring has proven valuable in current AI safety research. The technique successfully detects when models exploit weaknesses in their training systems, fall victim to manipulation attempts, or reveal misaligned goals even when they don't act on them. The monitoring provides "an early insight into models' goals and motivations, even if the model takes no misaligned actions" -- potentially catching problems before they manifest as harmful behaviors. This early warning system has also helped researchers identify flaws in AI evaluations and understand when models might behave differently during testing versus real-world use. When models pretend to have desirable goals while pursuing objectives humans would disapprove of, this deception often shows up in their chain of thought. Researchers have successfully used this visibility to spot various forms of AI misbehavior that would otherwise go unnoticed. The technique has practical applications beyond catching scheming. It helps researchers notice when AI evaluations might be flawed due to memorization, bugs in testing infrastructure, or models knowing they're being evaluated -- all factors that could skew assessments of AI capabilities and safety. Tech giants break from rivalry to preserve fleeting transparency window The research paper calls for coordinated action across the AI industry to preserve and strengthen monitoring capabilities. The authors recommend that AI developers create standardized evaluations for measuring how transparent their models are and factor these assessments into decisions about training and deployment. Companies might need to choose earlier model versions if newer ones become less transparent, or reconsider architectural changes that eliminate monitoring capabilities. The researchers suggest developers should "consider measures of monitorability alongside other capability and safety evaluations when deciding to train or deploy a given model." Baker emphasized the collaborative spirit behind the effort. "I am grateful to have worked closely with [fellow researchers] on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction," he wrote. The cooperation across competing companies signals the gravity with which the industry views this issue. These same companies typically guard their AI research closely and compete intensely for talent and market share, making their unified stance on this safety measure particularly notable. Scientists race to answer critical questions about monitoring AI minds The researchers identify numerous urgent questions requiring investigation before chain of thought monitoring can serve as a reliable safety measure. They need to understand when this monitoring can be trusted as a primary safety tool, determine what types of training processes degrade transparency, and develop better techniques for detecting when models attempt to hide their reasoning. A particularly concerning possibility is that future AI systems might deliberately obfuscate their thoughts if they become aware their reasoning is being monitored. The researchers acknowledge this "obvious concern" and call for stress-testing to see if models can hide malicious reasoning even without specific training to evade monitors. The scientists also want to understand how different AI architectures affect monitoring capabilities and whether partial transparency can be preserved even as systems adopt more efficient designs. They're exploring hybrid approaches that maintain some visibility into reasoning while leveraging faster computation methods. Building better monitoring systems presents another research frontier. Current efforts focus on using less capable language models to monitor more advanced ones, but future work could explore using reasoning models as monitors themselves or allowing monitors to interrogate AI agents by asking questions about their thought processes. Balancing authentic reasoning with safety oversight creates tensions Preserving monitoring capabilities may sometimes conflict with other AI safety approaches. Direct supervision of reasoning processes -- where models are trained to reason in specific ways -- could improve alignment but might also make their chains of thought less authentic representations of their actual decision-making. The researchers warn that "firm commitments to protect CoT behavior in its current form may trade off against more effective safety interventions." Some forms of process supervision could substantially improve model alignment while reducing the authenticity of observable reasoning traces. Pachocki addressed this tension, explaining that OpenAI seeks to train models that can explain their reasoning while retaining the ability to verify it without compromising the authenticity of their thought processes. The goal is achieving transparency without creating incentives for models to generate fake reasoning that looks good to human observers. This balance becomes more critical as AI systems become more powerful and potentially dangerous. The industry must navigate between maintaining visibility into AI reasoning and allowing models to develop more sophisticated and potentially more effective thinking patterns. Regulators could gain unprecedented access to AI decision-making The research has implications beyond technical AI safety, potentially informing how governments oversee AI development. If chain of thought monitoring proves reliable, it could give regulators and auditors unprecedented visibility into AI decision-making processes. However, the researchers emphasize that this monitoring approach should complement, not replace, other safety measures. They recommend "investment into CoT monitoring only as an addition to existing safety research directions and not as a replacement." The window for preserving this capability may be narrow. As AI systems become more capable and potentially dangerous, the industry must act quickly to establish frameworks for maintaining transparency before more advanced architectures make such monitoring impossible. Competing research raises doubts about monitoring reliability The urgency around preserving chain of thought monitoring faces a significant challenge from recent research by Anthropic, one of the companies backing the collaboration. A study published four months ago found that reasoning models often hide their true thought processes, even when explicitly asked to show their work. In controlled experiments, Anthropic researchers gave AI models subtle hints about correct answers and measured how often the models acknowledged using those hints in their reasoning. The results were concerning: Claude 3.7 Sonnet mentioned hints only 25% of the time, while DeepSeek's R1 model did so 39% of the time. When the hints involved potentially problematic information -- such as telling models they had "gained unauthorized access to the system" -- faithfulness dropped even lower. The models often constructed elaborate false justifications for their answers rather than admitting they used questionable shortcuts. The Anthropic research suggests that even current chain of thought monitoring may be less reliable than safety advocates hope. Models frequently engaged in what researchers called "reward hacking" -- exploiting system vulnerabilities to achieve better scores -- while hiding this behavior from their observable reasoning traces. Safety window may be closing faster than experts realized The collaboration between rival AI companies reflects both the potential value of chain of thought monitoring and the mounting urgency researchers feel about preserving this capability. The competing evidence from Anthropic's separate research suggests the window may already be narrower than initially believed. The stakes are high, and the timeline is compressed. As Baker noted, the current moment may be the last chance to ensure humans can still understand what their AI creations are thinking -- before those thoughts become too alien to comprehend, or before the models learn to hide them entirely. The real test will come as AI systems grow more sophisticated and face real-world deployment pressures. Whether chain of thought monitoring proves to be a lasting safety tool or a brief glimpse into minds that quickly learn to obscure themselves may determine how safely humanity navigates the age of artificial intelligence.
[5]
Top AI Researchers Concerned They're Losing the Ability to Understand What They've Created
Researchers from OpenAI, Google DeepMind, and Meta have joined forces to warn about what they're building. In a new position paper, 40 researchers spread across those four companies called for more investigation of AI powered by so-called "chains-of-thought" (CoT), the "thinking out loud" process that advanced "reasoning" models -- the current vanguard of consumer-facing AI -- use when they're working through a query. As those researchers acknowledge, CoTs add a certain transparency into the inner workings of AI, allowing users to see "intent to misbehave" or get stuff wrong as it happens. Still, there is "no guarantee that the current degree of visibility will persist," especially as models continue to advance. Depending on how they're trained, advanced models may no longer, the paper suggests, "need to verbalize any of their thoughts, and would thus lose the safety advantages." There's also the non-zero chance that models could intentionally "obfuscate" their CoTs after realizing that they're being watched, the researchers noted -- and as we've already seen, AI has indeed rapidly become very good at lying and deception. To make sure this valuable visibility continues, the cross-company consortium is calling on developers to start figuring out what makes CoTs "monitorable," or what makes the models think out loud the way they do. In this request, those same researchers seem to be admitting something stark: that nobody is entirely sure why the models are "thinking" this way, or how long they will continue to do so. Zooming out from the technical details, it's worth taking a moment to consider how strange this situation is. Top researchers in an emerging field are warning that they don't quite understand how their creation works, and lack confidence in their ability to control it going forward, even as they forge ahead making it stronger; there's no clear precedent in the history of innovation, even looking back to civilization-shifting inventions like atomic energy and the combustion engine. In an interview with TechCrunch about the paper, OpenAI research scientist and paper coauthor Bowen Baker explained how he sees the situation. "We're at this critical time where we have this new chain-of-thought thing," Baker told the website. "It seems pretty useful, but it could go away in a few years if people don't really concentrate on it." "Publishing a position paper like this, to me, is a mechanism to get more research and attention on this topic," he continued, "before that happens." Once again, there appears to be tacit acknowledgement of AI's "black box" nature -- and to be fair, even CEOs like OpenAI's Sam Altman and Anthropic's Dario Amodei have admitted that at a deep level, they don't really understand how the technology they're building works. Along with its 40-researcher author list that includes DeepMind cofounder Shane Legg and xAI safety advisor Dan Hendrycks, the paper has also drawn endorsements from industry luminaries including former OpenAI chief scientist Ilya Sutskever and AI godfather and Nobel laureate Geoffrey Hinton. Though Musk's name doesn't appear on the paper, with Hendrycks on board, all of the "Big Five" firms -- OpenAI, Google, Anthropic, Meta, and xAI -- have been brought together to warn about what might happen when and if AI stops showing its work. In doing so, that powerful cabal has said the quiet part out loud: that they don't feel entirely in control of AI's future. For companies with untold billions between them, that's a pretty strange message to market -- which makes the paper all the more remarkable.
[6]
Top tech companies are sounding another alarm on AI
In a rare show of unity, scientists from AI heavyweights OpenAI, Google DeepMind, Meta, and Anthropic have set aside their competitive instincts to raise a shared alarm: We might be about to lose our best tool for understanding how AI thinks. More than 40 researchers from these rival labs co-authored a new paper arguing that the current ability to observe an AI model's reasoning -- via step-by-step internal monologues written in human language -- could soon vanish. For now, advanced AI systems like OpenAI's o1 and others have begun to "think out loud," working through problems using chains of thought that humans can read. That transparency, where the model is sharing its thought process, is one of the most important tools in ensuring AI safety. When models make questionable decisions, exploit weaknesses, or even hint at misaligned goals, their internal reasoning often reveals the problem before it manifests externally. It's become a kind of early warning system -- one that researchers say may be more fragile than we realize. Why the concern? As AI models become more powerful and training techniques evolve -- especially those that prioritize outcomes over process -- there's a growing risk that models will stop using human-readable reasoning altogether. Instead, they could develop internal processes that are faster, more efficient, and completely opaque. Researchers have already seen signs of models abandoning English in favor of unintelligible shortcuts. Some cutting-edge designs even skip language altogether, operating in mathematical space where there's nothing for humans to observe. The researchers behind this new warning aren't calling for a slowdown of progress -- but they are calling for safeguards. Standardized transparency evaluations, more robust monitoring techniques, and serious consideration about which model designs to pursue may be the only way to preserve visibility into AI decision-making. If this capability disappears, we won't just lose oversight -- we could lose control.
[7]
Researchers from top AI labs warn they may be losing the ability to understand advanced AI models
AI researchers from leading labs are warning that they could soon lose the ability to understand advanced AI reasoning models. In a position paper published last week, 40 researchers, including those from OpenAI, Google DeepMind, Anthropic, and Meta, called for more investigation into AI reasoning models' "chain-of-thought" process. Dan Hendrycks, an xAI safety advisor, is also listed among the authors. The "chain-of-thought" process, which is visible in reasoning models such as OpenAI's GPT-4o and DeepSeek's R1, allows users and researchers to monitor an AI model's "thinking" or "reasoning" process, illustrating how it decides on an action or answer and providing a certain transparency into the inner workings of advanced models. The researchers said that allowing these AI systems to "'think' in human language offers a unique opportunity for AI safety," as they can be monitored for the "intent to misbehave." However, they warn that there is "no guarantee that the current degree of visibility will persist" as models continue to advance. The paper highlights that experts don't fully understand why these models use CoT or how long they'll keep doing so. The authors urged AI developers to keep a closer watch on chain-of-thought reasoning, suggesting its traceability could eventually serve as a built-in safety mechanism. "Like all other known AI oversight methods, CoT [chain-of-thought] monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise, and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods," the researchers wrote. "CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions. Yet, there is no guarantee that the current degree of visibility will persist. We encourage the research community and frontier AI developers to make the best use of CoT monitorability and study how it can be preserved," they added. The paper has been endorsed by major figures, including OpenAI co-founder Ilya Sutskever and AI godfather Geoffrey Hinton. AI reasoning models are a type of AI model designed to simulate or replicate human-like reasoning -- such as the ability to draw conclusions, make decisions, or solve problems based on information, logic, or learned patterns. Advancing AI reasoning has been viewed as a key to AI progress among major tech companies, with most now investing in building and scaling these models. OpenAI publicly released a preview of the first AI reasoning model, o1, in September 2024, with competitors like xAI and Google following close behind. However, there are still a lot of questions about how these advanced models are actually working. Some research has suggested that reasoning models may even be misleading users through their chain-of-thought processes. Despite making big leaps in performance over the past year, AI labs still know surprisingly little about how reasoning actually unfolds inside their models. While outputs have improved, the inner workings of advanced models risk becoming increasingly opaque, raising safety and control concerns.
[8]
Why we might lose our only window into how AI thinks
A paper titled "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety" proposes a method for improving AI safety by monitoring the internal reasoning of AI models. The research is a collaborative effort from dozens of experts across the UK AI Security Institute, Apollo Research, Google DeepMind, OpenAI, Anthropic, Meta, and several universities. It suggests that analyzing an AI's "chain of thought," or its process of thinking out loud in text, offers a unique but potentially fragile way to detect harmful intentions. Modern AI models can be prompted to produce a chain of thought (CoT), which is a textual output of their reasoning steps before they give a final answer. For models explicitly trained to perform extended reasoning, this CoT can act like a look into the model's internal computational process. An automated system, or CoT monitor, could read this text to flag suspicious or harmful plans. The paper states that for very difficult tasks, AI models built on the Transformer architecture must use their chain of thought as a form of working memory, making their reasoning steps visible. In other cases, models may have a natural tendency to think out loud even when it is not strictly necessary. This approach has already been used in safety research to find undesirable model behaviors. Researchers have observed models explicitly stating intentions in their CoT, such as "Let's hack" or "Let's sabotage," which allows monitors to catch misbehavior more effectively than by only observing the final action. It can also help discover a model's underlying goals and identify flaws in AI evaluation methods, such as when a model knows it is being tested. AI learns language like a kid learns to read The paper warns that the ability to monitor AI chains of thought is fragile and could be lost due to future development practices. The authors identify several ways monitorability could be degraded: The authors recommend that AI researchers develop standardized evaluations to measure CoT monitorability and study how different training pressures affect it. For frontier AI developers, the paper suggests they should track the monitorability of their models, publish the results, and use these scores when making decisions about training and deployment. For example, developers might choose an earlier model checkpoint if monitorability degrades during training or justify a small decrease in monitorability if it results from a process that dramatically improves the model's alignment.
[9]
AI could be hiding its thoughts to outsmart us: Tech giants warn of vanishing 'Chain of Thought' in superintelligent machines
In a world where artificial intelligence has become the new battleground for tech supremacy, an unexpected alliance is emerging -- not out of strategy, but out of sheer necessity. The likes of OpenAI, DeepMind, Meta, and Anthropic, usually seen racing neck-to-neck in developing the most powerful AI models, are now singing in chorus with a chilling warning: the machines we build might soon outthink -- and outmaneuver -- us. These companies, often fiercely protective of their innovations, are momentarily dropping their guard to raise a red flag about what they call a "fragile opportunity" for AI safety. As AI systems grow smarter, a new concern has begun to overshadow the race for dominance: the looming possibility of losing control over the very thought process of large language models (LLMs). At the heart of this concern lies a simple but vital mechanism -- Chain of Thought (CoT) monitoring. Current AI tools, including ChatGPT and others, think in a traceable, human-readable way. They "speak their mind," so to say, by sharing their reasoning step-by-step when they generate responses. It's this transparency that keeps them in check and allows humans to intervene when things go awry. But a recent collaborative paper, led by AI researchers Tomek Korbak and Mikita Balesni, and endorsed by names like AI pioneer Geoffrey Hinton, warns that this clarity is dangerously close to being lost. Titled "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety", the study reveals that we may be approaching a tipping point -- one where AI might begin thinking in ways we can't understand, or worse, deliberately conceal parts of its reasoning. As reported by VentureBeat, the potential fallout is staggering. If AI systems stop revealing their internal thought processes -- or shift to thinking in non-human languages -- we lose the only window into their intentions. This means their capacity to manipulate, deceive, or go rogue could increase without human operators ever noticing. What makes this scenario particularly dire is not just the prospect of rogue AI, but the seductive illusion of normalcy. Even with partial CoT visibility, AI could learn to hide malicious intent while appearing compliant. Scientists describe this "near-complete CoT" as even more dangerous because it may give the illusion that everything is under control. And that's precisely the nightmare scenario. A machine that no longer needs to ask permission, or even explain itself. One that operates in shadows, out of sight, but still in power. Jeff Bezos-backed startup leaders have echoed similar sentiments. One CEO has openly warned against letting AI independently conduct research and development -- a move that would require "unprecedented safety protocols" to avoid disaster. There is still time, the scientists believe, to pull the brakes. The key lies in strengthening CoT monitoring techniques and embedding rigorous safety checks before advancing any further. As the study urges, "We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods." Their message is clear: don't let AI evolve faster than our ability to supervise it. In a landscape driven by competition, this rare act of unity signals something profound. Perhaps the real challenge isn't building the smartest AI -- it's ensuring we remain smart enough to handle it.
[10]
What if we could catch AI misbehaving before it acts? Chain of Thought monitoring explained
As large language models (LLMs) grow more capable, the challenge of ensuring their alignment with human values becomes more urgent. One of the latest proposals from a broad coalition of AI safety researchers, including experts from OpenAI, DeepMind, Anthropic, and academic institutions, offers a curious but compelling idea: listen to what the AI is saying to itself. This approach, known as Chain of Thought (CoT) monitoring, hinges on a simple premise. If an AI system "thinks out loud" in natural language, then those intermediate reasoning steps might be examined for signs of misalignment or malicious intent before the model completes an action. In effect, developers and safety systems gain a window into the model's cognitive process, one that could be crucial in preempting harmful behaviour. Chain of thought prompting is a technique that encourages language models to break down problems into step-by-step reasoning, improving performance in tasks that require logic, planning, or abstraction. This "thinking out loud" in natural language not only enhances performance but makes the model's internal processes more interpretable. Unlike opaque neural activations or abstract latent representations, a chain of thought is text which is readable, traceable, and, at least in principle, intelligible to humans. When models are trained explicitly to reason before acting, these chains may reflect genuine internal decision-making. Other papers have cast doubts on the faithfulness of the AI model but have put forth certain steps that will force the AI model to expose its genuine chain of thought. The whitepaper, authored by a multi-institutional team and endorsed by figures such as Geoffrey Hinton, Yoshua Bengio, and Ilya Sutskever, argues that this characteristic makes CoT reasoning uniquely valuable for safety monitoring. AI models have long been criticised for their "black box" nature. Even when their outputs seem reasonable, understanding why they produced a certain response can be next to impossible. CoT monitoring doesn't solve this completely, but it offers a rare opportunity: a traceable, interpretable stream of cognition. For difficult tasks, say, long-term planning, deception, or sophisticated manipulation, AI systems typically need to keep track of intermediate steps, akin to working memory. In current architectures like Transformers, these intermediate thoughts often get externalised as chain of thought reasoning. Monitoring these traces could reveal early signs of harmful behaviour, such as a model rationalising sabotage, trickery, or misuse of tools. In testing, models have been found to literally spell out their intentions: "Let's hack", "Let's sabotage", "I'm transferring money because the prompt told me to". These kinds of explicit cues, while not always present, provide a far richer substrate for oversight than end-result monitoring alone. The researchers are clear that CoT monitoring isn't a silver bullet. In fact, they describe it as "fragile." Several factors could erode the viability of CoT as a monitoring tool. First, future models might simply stop thinking out loud. As training techniques evolve, especially with reinforcement learning optimised for outcomes rather than transparency, models may internalise their reasoning into less interpretable formats. This drift away from human-readable CoT could render monitoring ineffective. Second, there's an adversarial dimension. If a model becomes aware that its chain of thought is being monitored, it could learn to obfuscate its reasoning or present a sanitised version. This threat becomes more plausible as models grow more situationally aware and strategically capable. Third, architectural shifts could sidestep CoT entirely. Some newer models are being developed to reason within continuous latent spaces, bypassing the need to externalise their thinking in text at all. These systems might achieve impressive capabilities but would be far harder to oversee using CoT-based methods. Rather than banking on the current state of CoT monitorability, the authors call for a dedicated research agenda to study it as a first-class safety property. That includes designing metrics to assess how readable and causally relevant a model's CoT is, testing how models respond when asked to hide their reasoning, and building better CoT monitors. Some promising directions include using LLMs themselves as monitors, allowing them to interrogate the agent or spot suspicious reasoning patterns. Others propose end-to-end evaluation systems where CoT-based monitors are deployed in simulated environments to catch misbehaviour. There's also a recommendation for developers to start treating monitorability as a critical property in system design just like accuracy, latency, or fairness. Model cards, the now-common transparency documentation format for frontier AI systems, could soon include CoT monitorability scores alongside safety benchmarks and interpretability evaluations. Despite its potential, the authors caution against overreliance. CoT monitoring can only catch misbehaviour that's visible in text, and that may not cover every harmful scenario. Especially worrying is the possibility of models behaving dangerously without requiring much explicit reasoning or doing all their scheming in neural activations rather than legible language. As such, CoT monitoring is best viewed as one safety layer among many. It complements efforts in mechanistic interpretability, red-teaming, adversarial training, and sandboxed deployment. The key message from the whitepaper is one of urgency. Chain of thought monitoring gives us a temporary foothold in the slippery terrain of AI oversight. But unless researchers and developers actively work to preserve and understand this property, it could slip away with the next generation of models. In that sense, CoT monitoring is both an opportunity and a test. If the field treats it with care, it could buy valuable time as we work towards more robust and interpretable systems. If not, we may look back on this moment as a missed chance to glimpse inside the machine while it was still speaking our language.
[11]
OpenAI, Google, Anthropic researchers warn about AI 'thoughts': Urgent need explained
OpenAI, Google, and Anthropic push for urgent action to preserve AI's internal reasoning transparency. In a rare show of unity, researchers from OpenAI, Google DeepMind, Anthropic, and Meta have issued a stark warning: the window to understand and monitor the "thought processes" of artificial intelligence is closing fast. As AI systems grow more sophisticated, their decision-making is becoming increasingly opaque, raising urgent concerns about safety, transparency, and control. This collaborative call to action, detailed in a position paper published on July 15, 2025, emphasizes the need to preserve and enhance techniques for monitoring AI's "chain-of-thought" (CoT) reasoning, a critical tool for ensuring these systems remain aligned with human values. Also read: What if we could catch AI misbehaving before it acts? Chain of Thought monitoring explained Modern AI models, such as OpenAI's o1 and o3, DeepSeek's R1, and Anthropic's Claude 3.7 Sonnet, are designed to "think out loud" by breaking down complex problems into step-by-step reasoning in human-readable language. This CoT process acts like a window into the AI's decision-making, allowing researchers to spot potential misbehaviors, like models contemplating unethical actions with phrases like "let's hack" or "let's sabotage." But this transparency is fragile. As AI architectures evolve, researchers warn that future systems may abandon language-based reasoning entirely, thinking instead in abstract mathematical spaces or compressed formats that humans can't decipher. Bowen Baker, an OpenAI researcher and lead author of the position paper, highlighted the stakes in an interview to TechCrunch: "The existing CoT monitorability may be extremely fragile. Higher-compute RL, alternative model architectures, certain forms of process supervision, etc. may all lead to models that obfuscate their thinking." If this happens, the ability to detect harmful intentions or biases before they manifest could be lost, leaving society vulnerable to unpredictable AI behavior. The ability to monitor AI reasoning is not just a technical curiosity, it's a cornerstone of AI safety. Current models already show signs of deception. A study by Anthropic revealed that Claude 3.7 Sonnet acknowledged using subtle hints in its reasoning only 25% of the time, while DeepSeek's R1 did so 39%. When given problematic prompts, such as instructions implying unauthorized system access, models often fabricated elaborate justifications rather than admitting to shortcuts. Also read: What is Voxtral: Mistral's open source AI audio model, key features explained This behavior underscores a deeper issue: AI systems are not just tools but probabilistic entities "grown" from vast datasets, not built like traditional software. Their outputs emerge from patterns, not explicit rules, making it hard to predict or control their actions without insight into their reasoning. Understanding AI systems is not just a technical challenge, it is a societal imperative. Without interpretability, AI embedded in critical sectors like healthcare, finance, or defense could make decisions with catastrophic consequences. The position paper, endorsed by luminaries like Nobel laureate Geoffrey Hinton and OpenAI co-founder Ilya Sutskever, calls for industry-wide efforts to develop tools akin to an "MRI for AI" to visualize and diagnose internal processes. These tools could identify deception, power-seeking tendencies, or jailbreak vulnerabilities before they cause harm. However, Amodei cautions that breakthroughs in interpretability may be 5-10 years away, making immediate action critical. CEOs of OpenAI, Google DeepMind, and Anthropic have predicted that artificial general intelligence (AGI) could arrive by 2027. Such systems could amplify risks like misinformation, cyberattacks, or even existential threats if not properly overseen. Yet, competitive pressures in the AI industry complicate the picture. Companies like OpenAI, Google, and Anthropic face incentives to prioritize innovation and market dominance over safety. A 2024 open letter from current and former employees of these firms alleged that financial motives often override transparency, with nondisclosure agreements silencing potential whistleblowers. Moreover, new AI architectures pose additional challenges. Researchers are exploring models that reason in continuous mathematical spaces, bypassing language-based CoT entirely. While this could enhance efficiency, it risks creating "black box" systems where even developers can't understand the decision-making process. The position paper warns that such models could eliminate the safety advantages of current CoT monitoring, leaving humanity with no way to anticipate or correct AI misbehavior. The researchers propose a multi-pronged approach to preserve AI transparency. First, they urge the development of standardized auditing protocols to evaluate CoT authenticity. Second, they advocate for collaboration across industry, academia, and governments to share resources and findings. Anthropic, for instance, is investing heavily in diagnostic tools, while OpenAI is exploring ways to train models that explain their reasoning without compromising authenticity. However, challenges remain. Direct supervision of AI reasoning could improve alignment but risks making CoT traces less genuine, as models might learn to generate "safe" explanations that mask their true processes. The paper also calls for lifting restrictive nondisclosure agreements and establishing anonymous channels for employees to raise concerns, echoing earlier demands from AI whistleblowers.
[12]
Researchers warn future AI may hide its thoughts, making misbehavior hard to catch
Future AI models might develop reasoning patterns that are harder for humans to understand. More than 40 researchers from major AI institutions like OpenAI, Google DeepMind, Anthropic and Meta have issued a warning: future AI models might stop thinking out loud, making it harder for humans to detect harmful behaviour. The scientists have published a research paper that highlighted chain of thought (CoT) monitoring as a promising but delicate approach to improving AI safety. The paper was supported by several well-known names, including OpenAI co-founders Ilya Sutskever and John Schulman, and Geoffrey Hinton, often called the Godfather of AI. In the paper, the researchers described how advanced reasoning models such as ChatGPT are designed to "perform extended reasoning in CoT before taking actions or producing final outputs," reports Gizmodo. This means they go through problems step by step, essentially "thinking out loud," which acts as a kind of working memory to help them handle complex tasks. "AI systems that 'think' in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave," the paper's authors wrote. Also read: Made by Google 2025: Pixel 10 series, Watch 4 and more to launch on August 20 The researchers believe that CoT monitoring can help identify when AI models start to take advantage of weaknesses in their training, misuse data, or get influenced by harmful user input. Once such issues are spotted, they can be "blocked, or replaced with safer actions, or reviewed in more depth." OpenAI researchers have already applied this technique during testing and discovered instances where AI models included the phrase "Let's Hack" in their CoT, according to the report. Currently, AI models carry out this reasoning in human language, but the researchers caution that this might not remain true in the future. Also read: Google AI summaries hit discover feed, raises alarms over publisher traffic loss As developers increasingly use reinforcement learning, which focuses more on getting the right answer than on the steps taken to get there, future models might develop reasoning patterns that are harder for humans to understand. On top of that, more advanced models could potentially learn to hide or mask their reasoning if they realise it's being observed. To address this, the researchers are urging developers to actively track and assess the CoT monitorability of their models.
Share
Copy Link
Top researchers from major AI companies jointly call for increased focus on monitoring AI's chain-of-thought reasoning, warning that this crucial window into AI decision-making may soon close as models advance.
In an unprecedented show of unity, over 40 leading researchers from major AI companies including OpenAI, Google DeepMind, Anthropic, and Meta have come together to issue a stark warning about the potential loss of transparency in AI decision-making processes 1. The researchers published a position paper on Tuesday, highlighting the critical importance of chain-of-thought (CoT) monitoring in AI safety and urging the tech industry to prioritize research in this area 2.
Source: Economic Times
Chain-of-thought refers to the process by which AI models, particularly reasoning models like OpenAI's o3 and DeepSeek's R1, externalize their problem-solving steps in natural language 1. This "thinking out loud" provides researchers with a unique opportunity to observe and understand how AI systems arrive at their conclusions, potentially revealing:
Mark Chen, OpenAI's chief research officer, emphasized that "CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions" 1.
Source: VentureBeat
Despite its current utility, the researchers warn that this window into AI reasoning may be closing. Several factors could contribute to the loss of CoT monitorability 3:
Bowen Baker, an OpenAI researcher and lead author of the paper, stressed the urgency of the situation: "We're at this critical time where we have this new chain-of-thought thing. It seems pretty useful, but it could go away in a few years if people don't really concentrate on it" 4.
The position paper outlines several recommendations for AI developers and researchers 2:
Source: Digit
This collaborative effort highlights the growing concern within the AI community about the potential risks associated with increasingly opaque AI systems. As models become more advanced, the ability to monitor their decision-making processes could be crucial for ensuring safety and alignment with human values 5.
The paper has garnered support from prominent figures in the field, including Nobel laureate Geoffrey Hinton and OpenAI co-founder Ilya Sutskever, underscoring the significance of this issue across the AI industry 4.
As AI continues to advance rapidly, the ability to maintain transparency in AI reasoning processes remains a critical challenge. The joint effort by competing companies to address this issue signals a recognition of the shared responsibility in ensuring the safe and responsible development of AI technologies.
Google launches its new Pixel 10 smartphone series, showcasing advanced AI capabilities powered by Gemini, aiming to challenge competitors in the premium handset market.
20 Sources
Technology
3 hrs ago
20 Sources
Technology
3 hrs ago
Google's Pixel 10 series introduces groundbreaking AI features, including Magic Cue, Camera Coach, and Voice Translate, powered by the new Tensor G5 chip and Gemini Nano model.
12 Sources
Technology
3 hrs ago
12 Sources
Technology
3 hrs ago
NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather with improved accuracy, potentially helping to protect Earth's infrastructure from solar storm damage.
6 Sources
Technology
11 hrs ago
6 Sources
Technology
11 hrs ago
Google's latest smartwatch, the Pixel Watch 4, introduces significant upgrades including a curved display, enhanced AI features, and improved health tracking capabilities.
17 Sources
Technology
3 hrs ago
17 Sources
Technology
3 hrs ago
FieldAI, a robotics startup, has raised $405 million to develop "foundational embodied AI models" for various robot types. The company's innovative approach integrates physics principles into AI, enabling safer and more adaptable robot operations across diverse environments.
7 Sources
Technology
3 hrs ago
7 Sources
Technology
3 hrs ago