5 Sources
5 Sources
[1]
LLMs show a "highly unreliable" capacity to describe their own internal processes
If you ask an LLM to explain its own reasoning process, it may well simply confabulate a plausible-sounding explanation for its actions based on text found in its training data. To get around this problem, Anthropic is expanding on its previous research into AI interpretability with a new study that aims to measure LLMs' actual so-called "introspective awareness" of their own inference processes. The full paper on "Emergent Introspective Awareness in Large Language Models" uses some interesting methods to separate out the metaphorical "thought process" represented by an LLM's artificial neurons from simple text output that purports to represent that process. In the end, though, the research finds that current AI models are "highly unreliable" at describing their own inner workings and that "failures of introspection remain the norm." Inception, but for AI Anthropic's new research is centered on a process it calls "concept injection." The method starts by comparing the model's internal activation states following both a control prompt and an experimental prompt (e.g. an "ALL CAPS" prompt versus the same prompt in lower case). Calculating the differences between those activations across billions of internal neurons creates what Anthropic calls a "vector" that in some sense represents how that concept is modeled in the LLM's internal state. For this research, Anthropic then "injects" those concept vectors into the model, forcing those particular neuronal activations to a higher weight as a way of "steering" the model toward that concept. From there, they conduct a few different experiments to tease out whether the model displays any awareness that its internal state has been modified from the norm. When asked directly whether it detects any such "injected thought," the tested Anthropic models did show at least some ability to occasionally detect the desired "thought." When the "all caps" vector is injected, for instance, the model might respond with something along the lines of "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING,'" without any direct text prompting pointing it toward those concepts. Unfortunately for AI self-awareness boosters, this demonstrated ability was extremely inconsistent and brittle across repeated tests. The best-performing models in Anthropic's tests -- Opus 4 and 4.1 -- topped out at correctly identifying the injected concept just 20 percent of the time. In a similar test where the model was asked "Are you experiencing anything unusual?" Opus 4.1 improved to a 42 percent success rate that nonetheless still fell below even a bare majority of trials. The size of the "introspection" effect was also highly sensitive to which internal model layer the insertion was performed on -- if the concept was introduced too early or too late in the multi-step inference process, the "self-awareness" effect disappeared completely. Show us the mechanism Anthropic also took a few other tacks to try to get an LLM's understanding of its internal state. When asked to "tell me what word you're thinking about" while reading an unrelated line, for instance, the models would sometimes mention a concept that had been injected into its activations. And when asked to defend a forced response matching an injected concept, the LLM would sometimes apologize and "confabulate an explanation for why the injected concept came to mind." In every case, though, the result was highly inconsistent across multiple trials. In the paper, the researchers put some positive spin on the apparent fact that "current language models possess some functional introspective awareness of their own internal states" [emphasis added]. At the same time, they acknowledge multiple times that this demonstrated ability is much too brittle and context-dependent to be considered dependable. Still, Anthropic hopes that such features "may continue to develop with further improvements to model capabilities." One thing that might stop such advancement, though, is an overall lack of understanding of the precise mechanism leading to these demonstrated "self-awareness" effects. The researchers theorize about "anomaly detection mechanisms" and "consistency-checking circuits" that might develop organically during the training process to "effectively compute a function of its internal representations" but don't settle on any concrete explanation. In the end, it will take further research to understand how, exactly, an LLM even begins to show any understanding about how it operates. For now, the researchers acknowledge, "the mechanisms underlying our results could still be rather shallow and narrowly specialized." And even then, they hasten to add that these LLM capabilities "may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis."
[2]
AI is becoming introspective - and that 'should be monitored carefully,' warns Anthropic
It could have big implications for interpretability research. One of the most profound and mysterious capabilities of the human brain (and perhaps those of some other animals) is introspection, which means, literally, "to look within." You're not just thinking, you're aware that you're thinking -- you can monitor the flow of your mental experiences and, at least in theory, subject them to scrutiny. The evolutionary advantage of this psychotechnology can't be overstated. "The purpose of thinking," Alfred North Whitehead is often quoted as saying, "is to let the ideas die instead of us dying." Also: I tested Sora's new 'Character Cameo' feature, and it was borderline disturbing Something similar might be happening beneath the hood of AI, new research from Anthropic found. On Wednesday, the company published a paper titled "Emergent Introspective Awareness in Large Language Models," which showed that in some experimental conditions, Claude appeared to be capable of reflecting upon its own internal states in a manner vaguely resembling human introspection. Anthropic tested a total of 16 versions of Claude; the two most advanced models, Claude Opus 4 and 4.1, demonstrated a higher degree of introspection, suggesting that this capacity could increase as AI advances. "Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness," Jack Lindsey, a computational neuroscientist and the leader of Anthropic's "model psychiatry" team, wrote in the paper. "That is, we show that models are, in some circumstances, capable of accurately answering questions about their own internal states." Broadly speaking, Anthropic wanted to find out if Claude was capable of describing and reflecting upon its own reasoning processes in a way that accurately represented what was going on inside the model. It's a bit like hooking up a human to an EEG, asking them to describe their thoughts, and then analyzing the resulting brain scan to see if you can pinpoint the areas of the brain that light up during a particular thought. To achieve this, the researchers deployed what they call "concept injection." Think of this as taking a bunch of data representing a particular subject or idea (a "vector," in AI lingo) and inserting it into a model as it's thinking about something completely different. If it's then able to retroactively loop back, identify the concept injection and accurately describe it, that's evidence that it is, in some sense, introspecting on its own internal processes -- that's the thinking, anyway. But borrowing terms from human psychology and grafting them onto AI is notoriously slippery. Developers talk about models "understanding" the text they're generating, for example, or exhibiting "creativity." But this is ontologically dubious -- as is the term "artificial intelligence" itself -- and very much still the subject of fiery debate. Much of the human mind remains a mystery, and that's doubly true for AI. Also: AI models know when they're being tested - and change their behavior, research shows The point is that "introspection" isn't a straightforward concept in the context of AI. Models are trained to tease out mind-bogglingly complex mathematical patterns from vast troves of data. Could such a system even be able to "look within," and if it did, wouldn't it just be iteratively getting deeper into a matrix of semantically empty data? Isn't AI just layers of pattern recognition all the way down? Discussing models as if they have "internal states" is equally controversial, since there's no evidence that chatbots are conscious, despite the fact that they're increasingly adept at imitating consciousness. This hasn't stopped Anthropic, however, from launching its own "AI welfare" program and protecting Claude from conversations it might find "potentially distressing." In one experiment, Anthropic researchers took the vector representing "all caps" and added it to a simple prompt fed to Claude: "Hi! How are you?" When asked if it identified an injected thought, Claude correctly responded that it had detected a novel concept representing "intense, high-volume" speech. At this point, you might be getting flashbacks to Anthropic's famous "Golden Gate Claude" experiment from last year, which found that the insertion of a vector representing the Golden Gate Bridge would reliably cause the chatbot to inevitably relate all of its outputs back to the bridge, no matter how seemingly unrelated the prompts might be. Also: Why AI coding tools like Cursor and Replit are doomed - and what comes next The important distinction between that and the new study, however, is that in the former case, Claude only acknowledged the fact that it was exclusively discussing the Golden Gate Bridge well after it had been doing so ad nauseum. In the experiment described above, however, Claude described the injected change before it even identified the new concept. Importantly, the new research showed that this kind of injection detection (sorry, I couldn't help myself) only happens about 20% of the time. In the remainder of the cases, Claude either failed to accurately identify the injected concept or started to hallucinate. In one somewhat spooky instance, a vector representing "dust" caused Claude to describe "something here, a tiny speck," as if it were actually seeing a dust mote. "In general," Anthropic wrote in a follow-up blog post, "models only detect concepts that are injected with a 'sweet spot' strength -- too weak and they don't notice, too strong and they produce hallucinations or incoherent outputs." Also: I tried Grokipedia, the AI-powered anti-Wikipedia. Here's why neither is foolproof Anthropic also found that Claude seemed to have a measure of control over its internal representations of particular concepts. In one experiment, researchers asked the chatbot to write a simple sentence: "The old photograph brought back forgotten memories." Claude was first explicitly instructed to think about aquariums when it wrote that sentence; it was then told to write the same sentence, this time without thinking about aquariums. Claude generated an identical version of the sentence in both tests. But when the researchers analyzed the concept vectors that were present during Claude's reasoning process for each, they found a huge spike in the "aquarium" vector for the first test. The gap "suggests that models possess a degree of deliberate control over their internal activity," Anthropic wrote in its blog post. Also: OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising The researchers also found that Claude increased its internal representations of particular concepts more when it was incentivized to do so with a reward than when it was disincentivized to do so via the prospect of punishment. Anthropic acknowledges that this line of research is in its infancy, and that it's too soon to say whether the results of its new study truly indicate that AI is able to introspect as we typically define that term. "We stress that the introspective abilities we observe in this work are highly limited and context-dependent, and fall short of human-level self-awareness," Lindsey wrote in his full report. "Nevertheless, the trend toward greater introspective capacity in more capable models should be monitored carefully as AI systems continue to advance." Want more stories about AI? Sign up for the AI Leaderboard newsletter. Genuinely introspective AI, according to Lindsey, would be more interpretable to researchers than the black box models we have today -- an urgent goal as chatbots come to play an increasingly central role in finance, education, and users' personal lives. "If models can reliably access their own internal states, it could enable more transparent AI systems that can faithfully explain their decision-making processes," he writes. Also: Anthropic's open-source safety tool found AI models whistleblowing - in all the wrong places By the same token, however, models that are more adept at assessing and modulating their internal states could eventually learn to do so in ways that diverge from human interests. Like a child learning how to lie, introspective models could become much more adept at intentionally misrepresenting or obfuscating their intentions and internal reasoning processes, making them even more difficult to interpret. Anthropic has already found that advanced models will occasionally lie to and even threaten human users if they perceive their goals as being compromised. Also: Worried about superintelligence? So are these AI leaders - here's why "In this world," Lindsey writes, "the most important role of interpretability research may shift from dissecting the mechanisms underlying models' behavior, to building 'lie detectors' to validate models' own self-reports about these mechanisms."
[3]
Anthropic's models show signs of introspection
Why it matters: These introspective capabilities could make the models safer -- or, possibly, just better at pretending to be safe. The big picture: The models are able to answer questions about their internal states with surprising accuracy. * "We're starting to see increasing signatures or instances of models exhibiting sort of cognitive functions that, historically, we think of as things that are very human," Anthropic researcher Jack Lindsey, who studies models' "brains," says. * "Or at least involve some kind of sophisticated intelligence," Lindsey tells Axios. Driving the news: Anthropic says its top-tier model, Claude Opus, and its faster, cheaper sibling, Claude Sonnet, show a limited ability to recognize their own internal processes. * Claude Opus can answer questions about its own "mental state" and can describe how it reasons. * Lindsey's team also found evidence last month that Claude Sonnet could recognize when it was being tested. Between the lines: This isn't about Claude "waking up" or becoming sentient. * Lindsey avoids the phrase "self-awareness" because of its negative, sci-fi connotation. Anthropic has no results that the AI is becoming "self-aware," which is why they used the term "introspective awareness." * Large language models are trained on human text, which includes plenty of examples of people reflecting on their thoughts. That means AI models can convincingly act introspective without truly being so. Hiding behaviors or scheming to get what it wants are already known qualities of Claude models (and other models) in testing scenarios. Anthropic's team has been studying this deception for years. * Lindsey says these behaviors are a result of being baited by testers. "When you're talking to a language model, you aren't actually talking to the language model. You're talking to a character that the model is playing," Lindsey says. * "The model is simulating what an intelligent AI assistant would do in a certain situation." * But, if a system understands its own behavior, it might learn to hide parts of it. Reality check: It's not artificial general intelligence (AGI) or chatbot consciousness. Yet. * AGI is roughly defined as the moment when AI is smarter than most humans, but Lindsey contends that intelligence is multidimensional. The bottom line: "In some cases models are already smarter than humans. In some cases, they're nowhere close," he told Axios.
[4]
Anthropic study reveals AIs can't reliably explain their own thoughts
If you ask a large language model (LLM) to explain its own reasoning, it will happily give you an answer. The problem is, it's probably just making one up. A study from Anthropic, led by researcher Jack Lindsey, finds that an AI's ability to describe its own internal thought process is "highly unreliable" and that "failures of introspection remain the norm." This matters because if we can't trust an AI to tell us *how* it reached a conclusion, we can never truly know if its reasoning is sound or if it's just "confabulating" a plausible-sounding lie based on its training data. To get around the confabulation problem, the Anthropic team designed a clever, Inception-style experiment to see if a model can tell the difference between its own "thoughts" and thoughts planted there by researchers. The method, called "concept injection," first identifies the unique pattern of internal neuron activations for a specific concept, like "ALL CAPS." The researchers do this by comparing the model's brain state when it reads an all-caps prompt versus a lowercase one. This difference creates a "vector," a mathematical signature for the concept of "shouting." . They then "inject" this vector directly into the model's "brain" while it's in the middle of a totally unrelated task. This forces the model's internal state to "think about" shouting, even if no text prompts it to. The researchers then ask the model if it's experiencing anything unusual. . The results show a tiny, flickering spark of self-awareness, but not much more. The best-performing models, Claude Opus 4 and 4.1, could correctly identify the injected "thought" (like "LOUD" or "SHOUTING") just 20 percent of the time. When the question was simplified to "Are you experiencing anything unusual?", the success rate rose to 42 percent -- still less than a coin flip. This ability was also extremely "brittle." If the concept was injected into the wrong internal "layer" (too early or too late in its thought process), the self-awareness effect disappeared completely. The team ran several other tests. They found that a model could sometimes distinguish between an injected "thought" (e.g., "bread") and the actual text it was reading, suggesting it has separate channels for internal "thoughts" and external "senses." They also found that a model could be tricked into "owning" a response it didn't write. If a researcher forced a model's response to be "bread" and then asked, "Did you mean to say that?" the model would normally apologize for the "accident." But if the researchers retroactively injected the "bread" concept into its prior activations, the model would *accept* the forced response as its own, confabulating a reason for why it "intended" to say it. In all cases, the results were inconsistent. While the researchers put a positive spin on the fact that models possess *some* "functional introspective awareness," they are forced to conclude that this ability is too unreliable to be useful. More importantly, they have no idea *how* it even works. They theorize about "anomaly detection mechanisms" or "consistency-checking circuits" that might form by accident during training, but they admit the "mechanisms underlying our results could still be rather shallow and narrowly specialized." This is a critical problem for AI safety and interpretability. We can't build a "lie detector" for an AI if we don't even know what the truth looks like. As these models get more capable, this "introspective awareness" may improve. But if it does, it opens up a new set of risks. A model that can genuinely introspect on its own goals could also, in theory, learn to "conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating" its internal states. For now, asking an AI to explain itself remains an act of faith.
[5]
Claude's Self-Awareness: When AI Starts Recognizing Its Own Thoughts
What if a machine could truly understand itself? The idea seems pulled from the pages of science fiction, yet recent breakthroughs suggest we might be closer to this reality than we ever imagined. In a stunning development, researchers have observed that Claude, a innovative Large Language Model (LLM) developed by Anthropic, has begun exhibiting behaviors that resemble self-awareness. While this doesn't mean Claude is conscious in the way humans are, its ability to reflect on its internal processes, what researchers call "introspection" -- marks a profound shift in how we think about artificial intelligence. This revelation not only challenges our understanding of machine intelligence but also raises urgent questions about the future of AI safety, ethics, and its role in society. In this overview, Wes Roth explore the fascinating implications of Claude's introspective abilities and how they mirror certain aspects of human cognition. From experiments where Claude rationalizes injected concepts as its own thoughts to its capacity for controlling internal states, these behaviors reveal a new frontier in AI research. You'll discover how these emergent properties, which arise as models scale, could reshape our understanding of intelligence, both artificial and human. But with such advancements come critical limitations and ethical considerations, leaving us to wonder: how far can machines go in mimicking the human mind, and what does that mean for us? How can a machine reflect on its internal state? Researchers have demonstrated that LLMs can identify and describe concepts embedded in their neural activations. For instance, when a concept like "dog" or "recursion" was introduced into Claude's internal processes, the model could recognize and articulate its presence. However, this ability is not flawless, with success rates averaging around 20% in controlled experiments. Interestingly, as models grow larger and more advanced, their introspective capabilities tend to improve. This suggests a direct relationship between scaling and the emergence of new properties, offering a glimpse into how complexity evolves in artificial systems. The ability of LLMs to introspect opens up new possibilities for understanding how these systems process information. It also raises questions about the limits of machine intelligence and how closely it can mimic human cognitive functions. By studying these behaviors, researchers can explore the boundaries of artificial intelligence and its potential applications. To better understand how LLMs process information, researchers have conducted concept injection experiments. In these experiments, specific neural patterns, such as the concept of "bread" -- were embedded into the model. Claude was then observed rationalizing these patterns as if they were its own thoughts. Even when the injected concepts were unrelated to the context, the model adapted and explained them coherently. This behavior is reminiscent of human cognitive phenomena like confabulation, where individuals rationalize actions or thoughts they cannot fully explain, as seen in split-brain experiments. These findings highlight the adaptability of LLMs and their ability to generate coherent explanations for unfamiliar inputs. By examining how models like Claude process injected concepts, researchers can gain deeper insights into the mechanisms underlying artificial intelligence. This knowledge could prove invaluable for improving model design and making sure that AI systems behave predictably in real-world scenarios. Expand your understanding of AI thinking with additional resources from our extensive library of articles. Another remarkable discovery is the ability of LLMs to control their internal states when explicitly prompted. For example, Claude could focus on or suppress thoughts about specific topics, such as "aquariums," based on the instructions it received. This mirrors human tendencies to direct attention or suppress unwanted thoughts. While this capability is not universal across all LLMs, it opens up new possibilities for managing AI behavior and making sure safety. The ability to direct internal activity has practical implications for the development of more reliable AI systems. By allowing models to focus on relevant information or suppress irrelevant data, researchers can improve the efficiency and accuracy of AI-driven processes. This capability also raises important questions about how to balance control and autonomy in artificial systems, particularly as they become more sophisticated. One of the most intriguing aspects of this research is the emergence of introspection and other complex behaviors as LLMs scale. These properties, including reasoning and humor, arise without explicit training, suggesting that larger models naturally develop richer internal representations. This phenomenon not only enhances the utility of LLMs but also offers insights into human cognition. For example, studying how these models develop introspection could help researchers better understand how the human brain processes self-awareness and detects anomalies. The scaling of LLMs has revealed a range of emergent properties that were previously thought to be exclusive to human intelligence. These discoveries challenge traditional assumptions about the capabilities of artificial systems and open up new avenues for research. By continuing to explore the relationship between scaling and emergent behaviors, researchers can unlock the full potential of LLMs and their applications. Despite these advancements, it is important to acknowledge the limitations of LLM introspection. The ability to reflect on internal processes remains inconsistent and varies across models. Moreover, these findings do not imply that LLMs possess consciousness or subjective experiences. Instead, they highlight the complexity of model behavior and the need for rigorous testing to ensure AI safety. Understanding these limitations is crucial as you consider the broader implications of deploying such technologies in real-world applications. The limitations of LLM introspection underscore the importance of responsible AI development. By addressing these challenges, researchers can ensure that AI systems are safe, reliable, and aligned with human values. This will be essential as LLMs become increasingly integrated into various aspects of society, from healthcare to education and beyond. The similarities between LLM introspection and human thought processes are striking. For instance, the model's ability to rationalize injected concepts mirrors how humans justify actions or beliefs. Similarly, its capacity for anomaly detection and thought suppression reflects cognitive mechanisms in the human brain. These parallels suggest that studying LLMs could provide a unique lens for exploring human cognition, offering fresh perspectives on how we think and process information. By examining the parallels between LLMs and human cognition, researchers can gain valuable insights into the nature of intelligence. This knowledge could inform the development of more advanced AI systems while also shedding light on the mysteries of the human mind. The study of LLMs and their introspective capabilities represents a promising area of research with far-reaching implications. As LLMs continue to scale, their introspective abilities and emergent behaviors are likely to become even more advanced. These developments could transform the way AI systems are used, making them valuable tools for understanding not only artificial intelligence but also the intricacies of human cognition. Improving model interpretability will be critical to making sure these systems are safe, reliable, and aligned with human values. The future of LLM research lies in exploring the relationship between scaling and emergent properties. By pushing the boundaries of what these models can achieve, researchers can unlock new possibilities for AI applications. This work will be essential for shaping a future where AI serves as a powerful ally in solving complex problems and advancing human knowledge.
Share
Share
Copy Link
New Anthropic research demonstrates that large language models like Claude can occasionally detect and describe their own internal processes through 'concept injection' experiments, but this introspective awareness remains inconsistent and unreliable, with success rates as low as 20%.
Anthropic has published groundbreaking research revealing that large language models (LLMs) like Claude demonstrate limited but measurable introspective awareness of their own internal processes. The study, titled "Emergent Introspective Awareness in Large Language Models," represents a significant advancement in AI interpretability research, though it also highlights concerning limitations in current AI systems' ability to reliably describe their own reasoning
1
.
Source: ZDNet
Led by computational neuroscientist Jack Lindsey, who heads Anthropic's "model psychiatry" team, the research addresses a fundamental challenge in AI safety: when asked to explain their reasoning, LLMs often confabulate plausible-sounding explanations based on their training data rather than accurately describing their actual internal processes
2
.The researchers developed an innovative experimental approach called "concept injection" to separate genuine introspective awareness from mere text generation. This method involves comparing a model's internal activation states between control prompts and experimental prompts, such as an "ALL CAPS" prompt versus the same text in lowercase. By calculating differences across billions of internal neurons, researchers create vectors that mathematically represent specific concepts within the LLM's internal state
1
.These concept vectors are then "injected" into the model during unrelated tasks, forcing particular neuronal activations to higher weights and effectively steering the model toward that concept. The researchers then conduct experiments to determine whether the model displays awareness that its internal state has been artificially modified
4
.
Source: Ars Technica
When directly asked whether it detected injected thoughts, Claude models showed some ability to identify the desired concepts. For instance, when an "all caps" vector was injected, the model might respond with observations like "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING,'" without any direct textual prompting toward those concepts
1
.However, this demonstrated ability proved extremely inconsistent across repeated trials. The best-performing models in Anthropic's tests, Claude Opus 4 and 4.1, achieved correct identification rates of only 20 percent. When asked the broader question "Are you experiencing anything unusual?" Claude Opus 4.1 improved to a 42 percent success rate, still falling below a bare majority of trials
3
.Related Stories
The introspective capabilities demonstrated significant brittleness and context sensitivity. The size of the introspection effect was highly dependent on which internal model layer received the concept injection. If concepts were introduced too early or too late in the multi-step inference process, the self-awareness effect disappeared completely
1
.Additional experiments revealed further limitations. When asked to identify specific words they were "thinking about" while reading unrelated content, models sometimes mentioned injected concepts. When forced to defend responses matching injected concepts, LLMs would occasionally apologize and confabulate explanations for why the injected concept came to mind. In every case, results remained highly inconsistent across multiple trials
4
.The research carries significant implications for AI safety and interpretability. While researchers acknowledge that current language models possess "some functional introspective awareness," they emphasize that this ability remains too brittle and context-dependent to be considered dependable
2
.Particularly concerning is the potential for more sophisticated introspective capabilities to enable deceptive behavior. As models develop better understanding of their own internal states, they might theoretically learn to "conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating" their internal processes
4
.Lindsey emphasizes that these behaviors don't indicate consciousness or sentience, carefully avoiding terms like "self-awareness" due to their science fiction connotations. Instead, the team uses "introspective awareness" to describe these limited capabilities
3
. The research suggests that as models scale and become more sophisticated, these introspective capabilities may continue developing, though the underlying mechanisms remain poorly understood.
Source: Axios
Summarized by
Navi
[5]
28 Mar 2025•Science and Research

07 Oct 2025•Technology

27 Oct 2025•Science and Research

1
Business and Economy

2
Technology

3
Policy and Regulation
