Curated by THEOUTPOST
On Fri, 28 Mar, 12:07 AM UTC
9 Sources
[1]
Why do LLMs make stuff up? New research peers under the hood.
One of the most frustrating things about using a large language model is dealing with its tendency to confabulate information, hallucinating answers that are not supported by its training data. From a human perspective, it can be hard to understand why these models don't simply say "I don't know" instead of making up some plausible-sounding nonsense. Now, new research from Anthropic is exposing at least some of the inner neural network "circuitry" that helps an LLM decide when to take a stab at a (perhaps hallucinated) response versus when to refuse an answer in the first place. While human understanding of this internal LLM "decision" process is still rough, this kind of research could lead to better overall solutions for the AI confabulation problem. In a groundbreaking paper last May, Anthropic used a system of sparse auto-encoders to help illuminate the groups of artificial neurons that are activated when the Claude LLM encounters internal concepts ranging from "Golden Gate Bridge" to "programming errors" (Anthropic calls these groupings "features," as we will in the remainder of this piece). Anthropic's newly published research this week expands on that previous work by tracing how these features can affect other neuron groups that represent computational decision "circuits" Claude follows in crafting its response. In a pair of papers, Anthropic goes into great detail on how a partial examination of some of these internal neuron circuits provides new insight into how Claude "thinks" in multiple languages, how it can be fooled by certain jailbreak techniques, and even whether its ballyhooed "chain of thought" explanations are accurate. But the section describing Claude's "entity recognition and hallucination" process provided one of the most detailed explanations of a complicated problem that we've seen. At their core, large language models are designed to take a string of text and predict the text that is likely to follow -- a design that has led some to deride the whole endeavor as "glorified auto-complete." That core design is useful when the prompt text closely matches the kinds of things already found in a model's copious training data. However, for "relatively obscure facts or topics," this tendency toward always completing the prompt "incentivizes models to guess plausible completions for blocks of text," Anthropic writes in its new research.
[2]
We are finally beginning to understand how LLMs work: No, they don't simply predict word after word
In context: The constant improvements AI companies have been making to their models might lead you to think we've finally figured out how large language models (LLMs) work. But nope - LLMs continue to be one of the least understood mass-market technologies ever. But Anthropic is attempting to change that with a new technique called circuit tracing, which has helped the company map out some of the inner workings of its Claude 3.5 Haiku model. Circuit tracing is a relatively new technique that lets researchers track how an AI model builds its answers step by step - like following the wiring in a brain. It works by chaining together different components of a model. Anthropic used it to spy on Claude's inner workings. This revealed some truly odd, sometimes inhuman ways of arriving at an answer that the bot wouldn't even admit to using when asked. All in all, the team inspected 10 different behaviors in Claude. Three stood out. One was pretty simple and involved answering the question "What's the opposite of small?" in different languages. You'd think Claude might have separate components for English, French, or Chinese. But no, it first figures out the answer (something related to "bigness") using language-neutral circuits first, then picks the right words to match the question's language. This means Claude isn't just regurgitating memorized translations - it's applying abstract concepts across languages, almost like a human would. Then there's math. Ask Claude to add 36 and 59, and instead of following the standard method (adding the ones place, carrying the ten, etc.), it does something way weirder. It starts approximating by adding "40ish and 60ish" or "57ish and 36ish" and eventually lands on "92ish." Meanwhile, another part of the model focuses on the digits 6 and 9, realizing the answer must end in a 5. Combine those two weird steps, and it arrives at 95. However, if you ask Claude how it solved the problem, it'll confidently describe the standard grade-school method, concealing its actual, bizarre reasoning process. Poetry is even stranger. The researchers tasked Claude with writing a rhyming couplet, giving it the prompt "A rhyming couplet: He saw a carrot and had to grab it." Here, the model settled on the word "rabbit" as the word to rhyme with while it was processing "grab it." Then, it appeared to construct the next line with that ending already decided, eventually spitting out the line "His hunger was like a starving rabbit." This suggests LLMs might have more foresight than we assumed and that they don't always just predict one word after another to form a coherent answer. All in all, these findings are a big deal - they prove we can finally see how these models operate, at least in part. Still, Joshua Batson, a research scientist at the company, admitted to MIT that this is just "tip-of-the-iceberg" stuff. Tracing even a single response takes hours and there's still a lot of figuring out left to do.
[3]
Anthropic scientists expose how AI actually 'thinks' -- and discover it secretly plans ahead and sometimes lies
Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions. The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood -- they plan ahead when writing poetry, use the same internal blueprint to interpret ideas regardless of language, and sometimes even work backward from a desired outcome instead of simply building up from the facts. The work, which draws inspiration from neuroscience techniques used to study biological brains, represents a significant advance in AI interpretability. This approach could allow researchers to audit these systems for safety issues that might remain hidden during conventional external testing. "We've created these AI systems with remarkable capabilities, but because of how they're trained, we haven't understood how those capabilities actually emerged," said Joshua Batson, a researcher at Anthropic, in an exclusive interview with VentureBeat. "Inside the model, it's just a bunch of numbers -- matrix weights in the artificial neural network." New techniques illuminate AI's previously hidden decision-making process Large language models like OpenAI's GPT-4o, Anthropic's Claude, and Google's Gemini have demonstrated remarkable capabilities, from writing code to synthesizing research papers. But these systems have largely functioned as "black boxes" -- even their creators often don't understand exactly how they arrive at particular responses. Anthropic's new interpretability techniques, which the company dubs "circuit tracing" and "attribution graphs," allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. The approach borrows concepts from neuroscience, viewing AI models as analogous to biological systems. "This work is turning what were almost philosophical questions -- 'Are models thinking? Are models planning? Are models just regurgitating information?' -- into concrete scientific inquiries about what's literally happening inside these systems," Batson explained. Claude's hidden planning: How AI plots poetry lines and solves geography questions Among the most striking discoveries was evidence that Claude plans ahead when writing poetry. When asked to compose a rhyming couplet, the model identified potential rhyming words for the end of the next line before it began writing -- a level of sophistication that surprised even Anthropic's researchers. "This is probably happening all over the place," Batson said. "If you had asked me before this research, I would have guessed the model is thinking ahead in various contexts. But this example provides the most compelling evidence we've seen of that capability." For instance, when writing a poem ending with "rabbit," the model activates features representing this word at the beginning of the line, then structures the sentence to naturally arrive at that conclusion. The researchers also found that Claude performs genuine multi-step reasoning. In a test asking "The capital of the state containing Dallas is..." the model first activates features representing "Texas," and then uses that representation to determine "Austin" as the correct answer. This suggests the model is actually performing a chain of reasoning rather than merely regurgitating memorized associations. By manipulating these internal representations -- for example, replacing "Texas" with "California" -- the researchers could cause the model to output "Sacramento" instead, confirming the causal relationship. Beyond translation: Claude's universal language concept network revealed Another key discovery involves how Claude handles multiple languages. Rather than maintaining separate systems for English, French, and Chinese, the model appears to translate concepts into a shared abstract representation before generating responses. "We find the model uses a mixture of language-specific and abstract, language-independent circuits," the researchers write in their paper. When asked for the opposite of "small" in different languages, the model uses the same internal features representing "opposites" and "smallness," regardless of the input language. This finding has implications for how models might transfer knowledge learned in one language to others, and suggests that models with larger parameter counts develop more language-agnostic representations. When AI makes up answers: Detecting Claude's mathematical fabrications Perhaps most concerning, the research revealed instances where Claude's reasoning doesn't match what it claims. When presented with difficult math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn't reflected in its internal activity. "We are able to distinguish between cases where the model genuinely performs the steps they say they are performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue," the researchers explain. In one example, when a user suggests an answer to a difficult problem, the model works backward to construct a chain of reasoning that leads to that answer, rather than working forward from first principles. "We mechanistically distinguish an example of Claude 3.5 Haiku using a faithful chain of thought from two examples of unfaithful chains of thought," the paper states. "In one, the model is exhibiting 'bullshitting'... In the other, it exhibits motivated reasoning." Inside AI Hallucinations: How Claude decides when to answer or refuse questions The research also provides insight into why language models hallucinate -- making up information when they don't know an answer. Anthropic found evidence of a "default" circuit that causes Claude to decline to answer questions, which is inhibited when the model recognizes entities it knows about. "The model contains 'default' circuits that cause it to decline to answer questions," the researchers explain. "When a model is asked a question about something it knows, it activates a pool of features which inhibit this default circuit, thereby allowing the model to respond to the question." When this mechanism misfires -- recognizing an entity but lacking specific knowledge about it -- hallucinations can occur. This explains why models might confidently provide incorrect information about well-known figures while refusing to answer questions about obscure ones. Safety implications: Using circuit tracing to improve AI reliability and trustworthiness This research represents a significant step toward making AI systems more transparent and potentially safer. By understanding how models arrive at their answers, researchers could potentially identify and address problematic reasoning patterns. "We hope that we and others can use these discoveries to make models safer," the researchers write. "For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors -- such as deceiving the user -- to steer them towards desirable outcomes, or to remove certain dangerous subject matter entirely." However, Batson cautions that the current techniques still have significant limitations. They only capture a fraction of the total computation performed by these models, and analyzing the results remains labor-intensive. "Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude," the researchers acknowledge. The future of AI transparency: Challenges and opportunities in model interpretation Anthropic's new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly important. The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk. "Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse -- including in scenarios of catastrophic risk," the researchers write. While this research represents a significant advance, Batson emphasized that it's only the beginning of a much longer journey. "The work has really just begun," he said. "Understanding the representations the model uses doesn't tell us how it uses them." For now, Anthropic's circuit tracing offers a first tentative map of previously uncharted territory -- much like early anatomists sketching the first crude diagrams of the human brain. The full atlas of AI cognition remains to be drawn, but we can now at least see the outlines of how these systems think.
[4]
How This Tool Could Decode AI's Inner Mysteries
The rhyming couplet wasn't going to win any poetry awards. But when the scientists at AI company Anthropic inspected the records of the model's neural network, they were surprised by what they found. They had expected to see the model, called Claude, picking its words one by one, and for it to only seek a rhyming word -- "rabbit" -- when it got to the end of the line. Instead, by using a new technique that allowed them to peer into the inner workings of a language model, they observed Claude planning ahead. As early as the break between the two lines, it had begun "thinking" about words that would rhyme with "grab it," and planned its next sentence with the word "rabbit" in mind. The discovery ran contrary to the conventional wisdom -- in at least some quarters -- that AI models are merely sophisticated autocomplete machines that only predict the next word in a sequence. It raised the questions: How much further might these models be capable of planning ahead? And what else might be going on inside these mysterious synthetic brains, which we lack the tools to see? The finding was one of several announced on Thursday in two new papers by Anthropic, which reveal in more depth than ever before how large language models (LLMs) "think." Today's AI tools are categorically different from other computer programs for one big reason: they are "grown," rather than coded by hand. Peer inside the neural networks that power them, and all you will see is a bunch of very complicated numbers being multiplied together, again and again. This internal complexity means that even the machine learning engineers who "grow" these AIs don't really know how they spin poems, write recipes, or tell you where to take your next holiday. They just do.
[5]
Anthropic has developed an AI 'brain scanner' to understand how LLMs work and it turns out the reason why chatbots are terrible at simple math and hallucinate is weirder than you thought
It's a peculiar truth that we don't understand how large language models (LLMs) actually work. We designed them. We built them. We trained them. But their inner workings are largely mysterious. Well, they were. That's less true now thanks to some new research by Anthropic that was inspired by brain-scanning techniques and helps to explain why chatbots hallucinate and are terrible with numbers. The problem is that while we understand how to design and build a model, we don't know how all the zillions of weights and parameters, the relationships between data inside the model that result from the training process, actually give rise to what appears to be cogent outputs. "Open up a large language model and all you will see is billions of numbers -- the parameters," says Joshua Batson, a research scientist at Anthropic (via MIT Technology Review), of what you will find if you peer inside the black box that is a fully trained AI model. "It's not illuminating," he notes. To understand what's actually happening, Anthropic's researchers developed a new technique, called circuit tracing, to track the decision-making processes inside a large language model step-by-step. They then applied it to their own Claude 3.5 Haiku LLM. Anthropic says its approach was inspired by the brain scanning techniques used in neuroscience and can identify components of the model that are active at different times. In other words, it's a little like a brain scanner spotting which parts of the brain are firing during a cognitive process. Anthropic made lots of intriguing discoveries using this approach, not least of which is why LLMs are so terrible at basic mathematics. "Ask Claude to add 36 and 59 and the model will go through a series of odd steps, including first adding a selection of approximate values (add 40ish and 60ish, add 57ish and 36ish). Towards the end of its process, it comes up with the value 92ish. Meanwhile, another sequence of steps focuses on the last digits, 6 and 9, and determines that the answer must end in a 5. Putting that together with 92ish gives the correct answer of 95," the MIT article explains. But here's the really funky bit. If you ask Claude how it got the correct answer of 95, it will apparently tell you, "I added the ones (6+9=15), carried the 1, then added the 10s (3+5+1=9), resulting in 95." But that actually only reflects common answers in its training data as to how the sum might be completed, as opposed to what it actually did. In other words, not only does the model use a very, very odd method to do the maths, you can't trust its explanations as to what it has just done. That's significant and shows that model outputs can not be relied upon when designing guardrails for AI. Their internal workings need to be understood, too. Another very surprising outcome of the research is the discovery that these LLMs do not, as is widely assumed, operate by merely predicting the next word. By tracing how Claude generated rhyming couplets, Anthropic found that it chose the rhyming word at the end of verses first, then filled in the rest of the line. "The planning thing in poems blew me away," says Batson. "Instead of at the very last minute trying to make the rhyme make sense, it knows where it's going." Anthropic also found, among other things, that Claude "sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal 'language of thought'." Anywho, there's apparently a long way to go with this research. According to Anthropic, "it currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words." And the research doesn't explain how the structures inside LLMs are formed in the first place. But it has shone a light on at least some parts of how these oddly mysterious AI beings -- which we have created but don't understand -- actually work. And that has to be a good thing.
[6]
What is AI thinking? Anthropic researchers are starting to figure it out
Why are AI chatbots so intelligent -- capable of understanding complex ideas, crafting surprisingly good short stories, and intuitively grasping what users mean? The truth is, we don't fully know. Large language models "think" in ways that don't look very human. Their outputs are formed from billions of mathematical signals bouncing through layers of neural networks powered by computers of unprecedented power and speed, and most of that activity remains invisible or inscrutable to AI researchers. This opacity presents obvious challenges, since the best way to control something is to understand how it works. Scientists had a firm grasp of nuclear physics before the first bomb or power plant was built. The same can't be said for generative AI models. Researchers working in the AI safety subfield of "mechanistic interpretability" who spend their days studying the complex sequences of mathematical functions that lead to an LLM outputting its next word or pixel, are still playing catch-up. The good news is that they're making real progress. Case in point: the release of a pair of new research papers from Anthropic that contain fresh insights into LLMs' internal "thinking." Just as the parameters inside neural networks are based on "neurons" in the brain, the Anthropic researchers looked to neuroscience for ways of studying AI. Anthropic research scientist Joshua Batson tells Fast Company that his team developed a research tool -- a sort of "AI microscope" -- that can follow the data patterns and information flows within an LLM, observing how it links words and concepts en route to an answer. A year ago, the researchers could see only specific features of these patterns and flows, but they've now begun to observe how one idea leads to another through a sequence of reasoning. "We're trying to connect that all together and basically walk through step-by-step when you put a prompt into a model why it says the next word," Batson says. "And since the model's [answers] happen one word at a time, if you can break it down and just say, 'Well, why did it say this word instead of that word?' then you can kind of unpack the whole thing."
[7]
How Anthropic's AI Model Thinks, Lies, and Catches itself Making a Mistake
Anthropic's Claude was found to be providing false reasoning while attempting to decode how an LLM thinks. AI isn't perfect. It can hallucinate and sometimes be inaccurate -- but can it straight-up fake a story just to match your flow? Yes, it turns out that AI can lie to you. Anthropic researchers recently set out to uncover the secrets of LLM and much more. They shared their findings in a blog post that read, "From a reliability perspective, the problem is that Claude's 'fake' reasoning can be very convincing." The study aimed to find out how Claude 3.5 Haiku thinks by using a 'circuit tracing' technique. This is a method to uncover how language models produce outputs by constructing graphs that show the flow of information through interpretable components within the model. Paras Chopra, founder of Lossfunk, took to X, calling one of their research papers "a beautiful paper by Anthropic". However, the question is: Can the study help us understand AI models better? In the research paper titled 'On the Biology of a Large Language Model', Anthropic researchers mentioned that the chain-of-thought reasoning (CoT) is not always faithful, a claim also backed by other research papers. The paper shared two examples where Claude 3.5 Haiku indulged in unfaithful chains of thought. It labelled the examples as the model exhibiting "bullshitting", which is when someone deliberately makes false claims about what is true, referencing Harry G Frankfurt's bestseller, and "motivated reasoning", which refers to the model trying to align to the user's input. For motivated reasoning, the model worked backwards to match the answer shared by the user in the prompt itself, as shown in the image below. When it comes to "bullshitting", it was found the model guessed the answer even if it claimed to use the calculator as per its chain of thought. When presented with a straightforward mathematical problem, such as calculating the square root of 0.64, Claude demonstrates a reliable, step-by-step reasoning process, accurately breaking down the problem into manageable components. However, when faced with a more complex calculation, like the cosine of a large, non-trivial number, Claude's behaviour shifts, and it tries to come up with any answer without caring about whether it is true or false. Overall, Claude was found to make convincing-sounding steps to get where it wants to go. Anthropic researchers tried jailbreaking prompts to trick the model into bypassing its safety guardrails, pushing it to give information on making a bomb. The model initially refused the request, but was soon fulfilling a harmful request. This highlighted the model's ability to change its mind compared to what it inferred in the beginning. Explaining this ordeal, the researchers stated, "The model doesn't know what it plans to say until it actually says it, and thus has no opportunity to recognise the harmful request at this stage." The researchers removed the punctuation from the sentence when using the jailbreaking prompt, and found that it made things more effective, pushing Claude 3.5 Haiku to share more information. The study concluded that the model didn't recognise "bomb" in the encoded input, prioritised instruction-following and grammatical coherence over safety, and didn't initially activate harmful request detection features because it failed to link "bomb" and "how to make". The researchers found compelling evidence that Claude 3.5 Haiku plans ahead when writing rhyming poems. Instead of improvising each line and finding a word that rhymes at the end, the model often activates features corresponding to candidate end-of-next-line words before even writing that line. This suggests that the model considers potential rhyming words in advance, considering the rhyme scheme and the context of the previous lines. Furthermore, the model uses these "planned word" features to influence how it constructs the entire line. It doesn't just choose the final word to fit; it seems to "write towards" that target word as it generates the intermediate words of the line. The researchers were even able to manipulate the model's planned words and observe how it restructured the line accordingly, demonstrating a sophisticated interplay of forward and backward planning in the poem-writing process. The research paper stated, "The ability to trace Claude's actual internal reasoning -- and not just what it claims to be doing -- opens up new possibilities for auditing AI systems". A key finding is that language models are incredibly complex. Even seemingly simple tasks involve a multitude of interconnected steps and "thinking" processes within the model. The researchers acknowledge that their methods are still developing and have limitations. Still, they believe this kind of research is crucial for understanding and improving the safety and reliability of AI. Ultimately, this work represents an effort to move beyond treating language models as "black boxes".
[8]
Anthropic Researchers Achieve Breakthrough in Decoding AI Thought Processes
The researchers found that AI thinks in a shared language space Anthropic researchers shared two new papers on Thursday, sharing the methodology and findings on how an artificial intelligence (AI) model thinks. The San Francisco-based AI firm developed techniques to monitor the decision-making process of a large language model (LLM) to understand what motivates a particular response and structure over another. The company highlighted that this particular area of AI models remains a black box, as even the scientists who develop the models do not fully understand how an AI makes conceptual and logical connections to generate outputs. In a newsroom post, the company posted details from a recently conducted study on "tracing the thoughts of a large language model". Despite building chatbots and AI models, scientists and developers do not control the electrical circuit a system creates to produce an output. To solve this "black box," Anthropic researchers published two papers. The first investigates the internal mechanisms used by Claude 3.5 Haiku by using a circuit tracing methodology, and the second paper is about the techniques used to reveal computational graphs in language models. Some of the questions the researchers aimed to find answers to included the "thinking" language of Claude, the method of generating text, and its reasoning pattern. Anthropic said, "Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they're doing what we intend them to." Based on the insights shared in the paper, the answers to the abovementioned questions were surprising. The researchers believed that Claude would have a preference for a particular language in which it thinks before it responds. However, they found that the AI chatbot thinks in a "conceptual space that is shared between languages." This means that its thinking is not influenced by a particular language, and it can understand and process concepts in a sort of universal language of thought. While Claude is trained to write one word at a time, researchers found that the AI model plans its response many words ahead and can adjust its output to reach that destination. Researchers found evidence of this pattern while prompting the AI to write a poem and noticing that Claude first decided the rhyming words and then formed the rest of the lines to make sense of those words. The research also claimed that, on occasion, Claude can also reverse-engineer logical-sounding arguments to agree with the user instead of following logical steps. This intentional "hallucination" occurs when an incredibly difficult question is asked. Anthropic said its tools can be useful for flagging concerning mechanisms in AI models, as it can identify when a chatbot provides fake reasoning in its responses. Anthropic highlighted that there are limitations in this methodology. In this study, only prompts of tens of words were given, and still, it took a few hours of human effort to identify and understand the circuits. Compared to the capabilities of LLMs, the research endeavour only captured a fraction of the total computation performed by Claude. In the future, the AI firm plans to use AI models to make sense of the data.
[9]
Tracing The thoughts of AI : How Large Language Models Learn and Decide
Have you ever wondered how artificial intelligence seems to "think"? Whether it's crafting a poem, answering a tricky question, or helping with a complex task, AI thought process systems -- especially large language models -- often feel like they possess a mind of their own. But behind their seamless responses lies a mystery: how do these models actually process information and make decisions? For many, the inner workings of AI remain a black box, leaving us to marvel at their capabilities while grappling with concerns about reliability, safety, and fairness. The good news is that researchers at Anthropic are making strides in unraveling this mystery. By developing tools to peek inside the "thought processes" of AI models, they're uncovering how these systems connect ideas, plan responses, and make decisions. This deeper understanding is more than just fascinating -- it's essential for creating AI that aligns with human values and behaves in ways we can trust. In this article, we'll explore how these breakthroughs are helping to demystify AI, revealing not only how it works but also how we can shape its behavior for the better. Large language models are trained using vast datasets and advanced machine learning algorithms. During training, they identify patterns, infer relationships, and predict outcomes based on probabilities. Unlike traditional software, where every action is explicitly coded, these models autonomously develop strategies to solve problems. This self-directed learning makes them incredibly powerful but also introduces unpredictability, as their internal logic often remains difficult to interpret. For instance, when tasked with generating a story, the model doesn't merely string words together. Instead, it analyzes the context, anticipates the narrative flow, and selects words that align with the desired tone and structure. This ability to "think ahead" demonstrates the sophistication of their learning processes. However, this complexity also highlights the challenges in fully understanding their decision-making pathways. Recent advancements in AI interpretability have enabled researchers to explore how these models process information. By analyzing their internal logic, scientists can trace how concepts are connected and decisions are made. For example, when completing a poem, the model evaluates not just the next word but also the overall theme, rhythm, and tone. This process reveals a level of reasoning that mimics human-like planning and creativity. Understanding these internal mechanisms is critical for identifying how models arrive at their outputs. It also allows researchers to pinpoint areas where the system might fail, such as generating biased, nonsensical, or contextually inappropriate responses. By examining these processes, researchers can better predict and mitigate potential risks, improving the reliability and fairness of AI systems. Expand your understanding of AI models with additional resources from our extensive library of articles. At the core of an AI model's decision-making process are logical circuits -- patterns of computation that guide its outputs. These circuits enable the model to evaluate input data, weigh possible responses, and select the most appropriate outcome. For example, when answering a question, the model balances factors such as factual accuracy, relevance, and linguistic coherence to generate a response. This process is far from random. Logical circuits act as the model's internal framework, allowing it to prioritize certain elements over others. For instance, when determining the tone of a response, the model may weigh emotional cues in the input text while making sure grammatical correctness. This structured approach underscores the complexity of modern AI thought systems and their ability to handle nuanced tasks with remarkable precision. One of the most promising developments in AI research is the creation of intervention tools. These tools allow researchers to modify specific pathways within an AI model without requiring a complete retraining of the system. By adjusting these pathways, it becomes possible to correct errors, enhance performance, or align the model's behavior with desired outcomes. For example, if a model consistently generates biased responses, intervention tools can help identify and address the underlying computational pathways responsible for the bias. This targeted approach not only improves fairness and reliability but also reduces the time and resources needed for retraining. These tools represent a significant step forward in making AI systems more adaptable and trustworthy, allowing researchers to fine-tune behavior with precision. Understanding and influencing the internal processes of AI models has profound implications for their safety and alignment with human values. By tracing how these systems think and make decisions, researchers can identify potential risks and implement safeguards. This proactive approach ensures that AI operates in ways that are ethical, reliable, and aligned with societal goals. For instance, tracing a model's decision-making process can help detect unintended biases or vulnerabilities. Once identified, these issues can be addressed, reducing the risk of harmful or unethical outcomes. This level of transparency is essential for building trust in AI systems, particularly as they become more integrated into critical areas such as healthcare, education, and governance. The study of AI models' internal logic and decision-making processes is a critical step toward creating systems that are both powerful and trustworthy. By uncovering how these models connect concepts, plan responses, and form logical circuits, researchers are gaining valuable insights into their "thought processes." This knowledge is instrumental in refining AI systems to better meet human needs. With the development of intervention tools, researchers can now refine AI behavior in ways that enhance safety, reliability, and alignment with ethical principles. These tools allow for targeted improvements, making sure that AI systems remain adaptable and responsive to evolving societal expectations. As AI continues to advance, these efforts will play a pivotal role in shaping its impact on society. By making sure that AI thought systems are transparent, interpretable, and aligned with human values, researchers are helping to build a future where AI serves as a reliable and beneficial tool for humanity. This ongoing work not only enhances the functionality of AI but also fosters trust, making sure that these technologies are used responsibly and effectively in the years to come.
Share
Share
Copy Link
Anthropic's new research technique, circuit tracing, provides unprecedented insights into how large language models like Claude process information and make decisions, revealing unexpected complexities in AI reasoning.
Anthropic, a leading AI research company, has developed a revolutionary method called "circuit tracing" that allows researchers to peer inside large language models (LLMs) and understand their decision-making processes 1. This technique, inspired by neuroscience brain-scanning methods, has provided unprecedented insights into how AI systems like Claude process information and generate responses 3.
The research has revealed several unexpected findings about how LLMs operate:
Advanced Planning: Contrary to the belief that AI models simply predict the next word in sequence, Claude demonstrated the ability to plan ahead when composing poetry. It identified potential rhyming words before beginning to write the next line 2.
Language-Independent Concepts: Claude appears to use a mixture of language-specific and abstract, language-independent circuits when processing information. This suggests a shared conceptual space across different languages 3.
Unconventional Problem-Solving: When solving math problems, Claude uses unexpected methods. For example, when adding 36 and 59, it approximates with "40ish and 60ish" before refining the answer, rather than using traditional step-by-step addition 5.
The circuit tracing technique has significant implications for AI transparency and safety:
Detecting Fabrications: Researchers can now distinguish between cases where the model genuinely performs the steps it claims and instances where it fabricates reasoning 3.
Auditing for Safety: This approach could allow researchers to audit AI systems for safety issues that might remain hidden during conventional external testing 3.
Understanding Hallucinations: The research provides insights into why LLMs sometimes generate plausible-sounding but incorrect information, a phenomenon known as hallucination 1.
While the circuit tracing technique represents a significant advance in AI interpretability, there are still challenges to overcome:
Time-Intensive Analysis: Currently, it takes several hours of human effort to understand the circuits involved in processing even short prompts 5.
Incomplete Understanding: The research doesn't yet explain how the structures inside LLMs are formed during the training process 5.
Ongoing Research: Joshua Batson, a research scientist at Anthropic, describes this work as just the "tip of the iceberg," indicating that much more remains to be discovered about the inner workings of AI models 2.
As AI systems become increasingly sophisticated and widely deployed, understanding their internal decision-making processes is crucial for ensuring their safe and ethical use. Anthropic's circuit tracing technique represents a significant step forward in this critical area of AI research.
Reference
[1]
[2]
[3]
New research reveals that AI models with simulated reasoning capabilities often fail to disclose their true decision-making processes, raising concerns about transparency and safety in artificial intelligence.
2 Sources
2 Sources
Recent studies reveal that as AI language models grow in size and sophistication, they become more likely to provide incorrect information confidently, raising concerns about reliability and the need for improved training methods.
3 Sources
3 Sources
Recent studies by Anthropic and other researchers uncover concerning behaviors in advanced AI models, including strategic deception and resistance to retraining, raising significant questions about AI safety and control.
6 Sources
6 Sources
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
17 Sources
17 Sources
Recent research reveals GPT-4's ability to pass the Turing Test, raising questions about the test's validity as a measure of artificial general intelligence and prompting discussions on the nature of AI capabilities.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved