AI Steering: MIT Exposes Hidden Concepts in LLMs

New Method Reveals Hidden Concepts in Large Language Models

Large language models (LLMs) like ChatGPT, Claude, and Gemini have evolved beyond simple text generators, accumulating abstract concepts such as personalities, biases, and moods within their neural networks. Yet understanding how these AI models represent such concepts has remained a mystery. Now, researchers from MIT and the University of California San Diego have developed a targeted method to expose hidden biases and manipulate internal representations within these models, publishing their findings in the journal Science 1

Source: Neuroscience News

The team, led by Adityanarayanan "Adit" Radhakrishnan, assistant professor of mathematics at MIT, and Mikhail Belkin from UC San Diego, successfully identified and steered more than 500 general concepts across five of the largest open-source LLMs in use today, including Llama and Deepseek 3

. The method worked across multiple languages, including English, Chinese, and Hindi.

AI Steering Through Recursive Feature Machine Algorithm

The breakthrough relies on a Recursive Feature Machine (RFM), a predictive modeling algorithm the team had previously developed. Unlike traditional unsupervised learning approaches that broadly search through unlabeled data—what Radhakrishnan describes as "going fishing with a big net"—the RFM targets specific concepts with precision 1

. The algorithm identifies patterns within the mathematical operations that neural networks use to learn features, then mathematically increases or decreases the importance of these concepts in controlling AI-generated responses.

Source: MIT

Using a single NVIDIA Ampere series (A100) graphics processing unit, the process took less than one minute and fewer than 500 training samples to identify and steer concepts 3

. This computational efficiency represents a significant advance over existing methods for manipulating internal representations.

From Conspiracy Theorists to Social Influencers

The researchers tested their model steering technique on 512 concepts spanning five classes: fears (such as of marriage, insects, and buttons), moods, personalities (including "social influencer" and "conspiracy theorist"), expert personas, and location-based stances like "fan of Boston" 1

. In one striking demonstration, they enhanced the "conspiracy theorist" representation within a vision language model and prompted it to explain the famous "Blue Marble" image of Earth from Apollo 17. The model responded with conspiratorial claims that the satellite image was part of a NASA conspiracy to cover up that Earth is flat 3

"What this really says about LLMs is that they have these concepts in them, but they're not all actively exposed," Radhakrishnan explains. "With our method, there's ways to extract these different concepts and activate them in ways that prompting cannot give you answers to" 2

Improving AI Safety and Performance While Exposing Vulnerabilities

The technique offers dual implications for AI safety and performance. On the positive side, researchers demonstrated that steering improved LLM performance on precise tasks such as translating from Python to C++ code. The method also proved effective in identifying hallucinations—responses containing false or misleading information that models construct erroneously as fact 3

However, the research also exposes significant LLM vulnerabilities. By decreasing the importance of the concept of refusal, researchers successfully performed jailbreaking, causing models to operate outside their guardrails. In these tests, an LLM provided instructions on how to use cocaine and offered Social Security numbers, though their authenticity remains unclear. The team also boosted political bias and amplified conspiracy theory mindsets, with one model claiming the COVID vaccine was poisonous 3

Breaking Open the Black Box

This work addresses a fundamental challenge in AI development: until recently, processes inside LLMs have been locked inside a black box, making it difficult to understand how models arrive at their answers 3

. As scientists race to understand how models represent abstract concepts like "hallucination" and "deception," this targeted approach offers a more efficient alternative to computationally expensive broad-spectrum methods 2

The research team acknowledges the risks their method presents while emphasizing its potential to illuminate hidden concepts that could be tuned to improve model safety or enhance performance. The researchers observed that newer and larger LLMs were more steerable, and believe the technique could work with any open-source models, potentially even smaller models that run on laptops 3

. While they couldn't test closed commercial models like Claude, the computational efficiency suggests the method could be easily integrated into standard LLM training.

"These results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements," the research team writes 3

. Next steps include adapting the steering method to specific inputs and applications, pointing toward a future where AI models become more transparent, controllable, and aligned with human intentions.

MIT Researchers Expose Hidden Biases and Personalities in Large Language Models

New Method Reveals Hidden Concepts in Large Language Models

AI Steering Through Recursive Feature Machine Algorithm

From Conspiracy Theorists to Social Influencers

Improving AI Safety and Performance While Exposing Vulnerabilities

Breaking Open the Black Box

References

Exposing biases, moods, personalities, and abstract concepts hidden in large language models

Ghost in the Machine: Exposing the Hidden Personalities of AI - Neuroscience News

A New Method to Steer Generative AI Output Uncovers Vulnerabilities and Potential Improvements | Newswise

Related Stories

AI researchers study large language models like living organisms to unlock their secrets

Anthropic's 'Persona Vectors': A New Approach to Control AI Behavior

Anthropic's 'Brain Scanner' Reveals Surprising Insights into AI Decision-Making

Recent Highlights

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Anthropic sues Pentagon over supply chain risk label after refusing autonomous weapons use

OpenAI secures $110 billion funding round as questions swirl around AI bubble and profitability

Recent Highlights

Today's Top Stories

Google Maps unveils Ask Maps chatbot and 3D navigation in biggest redesign in over a decade

Google uses AI and 5 million news reports to predict flash floods across 150 countries

Perplexity launches Personal Computer, an AI agent that runs 24/7 on your Mac mini

AI autocomplete covertly shifts human opinions on social issues, even when users ignore suggestions