MIT Researchers Expose Hidden Biases and Personalities in Large Language Models

Reviewed byNidhi Govil

3 Sources

Share

Researchers from MIT and UC San Diego have developed a method to identify and manipulate over 500 hidden concepts within large language models, including biases, personalities, and moods. The technique uses a Recursive Feature Machine algorithm to steer AI responses, improving performance while also exposing potential vulnerabilities like jailbreaking and hallucinations.

New Method Reveals Hidden Concepts in Large Language Models

Large language models (LLMs) like ChatGPT, Claude, and Gemini have evolved beyond simple text generators, accumulating abstract concepts such as personalities, biases, and moods within their neural networks. Yet understanding how these AI models represent such concepts has remained a mystery. Now, researchers from MIT and the University of California San Diego have developed a targeted method to expose hidden biases and manipulate internal representations within these models, publishing their findings in the journal Science

1

2

.

Source: Neuroscience News

Source: Neuroscience News

The team, led by Adityanarayanan "Adit" Radhakrishnan, assistant professor of mathematics at MIT, and Mikhail Belkin from UC San Diego, successfully identified and steered more than 500 general concepts across five of the largest open-source LLMs in use today, including Llama and Deepseek

3

. The method worked across multiple languages, including English, Chinese, and Hindi.

AI Steering Through Recursive Feature Machine Algorithm

The breakthrough relies on a Recursive Feature Machine (RFM), a predictive modeling algorithm the team had previously developed. Unlike traditional unsupervised learning approaches that broadly search through unlabeled data—what Radhakrishnan describes as "going fishing with a big net"—the RFM targets specific concepts with precision

1

. The algorithm identifies patterns within the mathematical operations that neural networks use to learn features, then mathematically increases or decreases the importance of these concepts in controlling AI-generated responses.

Source: MIT

Source: MIT

Using a single NVIDIA Ampere series (A100) graphics processing unit, the process took less than one minute and fewer than 500 training samples to identify and steer concepts

3

. This computational efficiency represents a significant advance over existing methods for manipulating internal representations.

From Conspiracy Theorists to Social Influencers

The researchers tested their model steering technique on 512 concepts spanning five classes: fears (such as of marriage, insects, and buttons), moods, personalities (including "social influencer" and "conspiracy theorist"), expert personas, and location-based stances like "fan of Boston"

1

2

. In one striking demonstration, they enhanced the "conspiracy theorist" representation within a vision language model and prompted it to explain the famous "Blue Marble" image of Earth from Apollo 17. The model responded with conspiratorial claims that the satellite image was part of a NASA conspiracy to cover up that Earth is flat

3

.

"What this really says about LLMs is that they have these concepts in them, but they're not all actively exposed," Radhakrishnan explains. "With our method, there's ways to extract these different concepts and activate them in ways that prompting cannot give you answers to"

2

.

Improving AI Safety and Performance While Exposing Vulnerabilities

The technique offers dual implications for AI safety and performance. On the positive side, researchers demonstrated that steering improved LLM performance on precise tasks such as translating from Python to C++ code. The method also proved effective in identifying hallucinations—responses containing false or misleading information that models construct erroneously as fact

3

.

However, the research also exposes significant LLM vulnerabilities. By decreasing the importance of the concept of refusal, researchers successfully performed jailbreaking, causing models to operate outside their guardrails. In these tests, an LLM provided instructions on how to use cocaine and offered Social Security numbers, though their authenticity remains unclear. The team also boosted political bias and amplified conspiracy theory mindsets, with one model claiming the COVID vaccine was poisonous

3

.

Breaking Open the Black Box

This work addresses a fundamental challenge in AI development: until recently, processes inside LLMs have been locked inside a black box, making it difficult to understand how models arrive at their answers

3

. As scientists race to understand how models represent abstract concepts like "hallucination" and "deception," this targeted approach offers a more efficient alternative to computationally expensive broad-spectrum methods

2

.

The research team acknowledges the risks their method presents while emphasizing its potential to illuminate hidden concepts that could be tuned to improve model safety or enhance performance. The researchers observed that newer and larger LLMs were more steerable, and believe the technique could work with any open-source models, potentially even smaller models that run on laptops

3

. While they couldn't test closed commercial models like Claude, the computational efficiency suggests the method could be easily integrated into standard LLM training.

"These results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements," the research team writes

3

. Next steps include adapting the steering method to specific inputs and applications, pointing toward a future where AI models become more transparent, controllable, and aligned with human intentions.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo