State Media Control Shapes LLM Training Data

State Media Control Leaves Detectable Traces in AI Systems

Governments can shape what large language models say without directly building them, according to groundbreaking research published in Nature. A team spanning University of Oregon, Purdue University, UC San Diego, NYU, and Princeton University discovered that state media control influences large language models by flooding the information environment that produces training data1

. The influence operates through a mechanism researchers call "institutional influence on AI"—where powerful institutions shape discourse through media control, which then affects LLMs they don't directly control3

Source: Newswise

Evidence Spans 37 Countries and Multiple Studies

The researchers conducted six interconnected studies combining analysis of open training data, experiments with training small models, human evaluation, and real-world tests of commercial chatbots. Their cross-national analysis examined 37 states where at least 70% of speakers of an official language live within state borders1

. The pattern was consistent: states with tighter media control showed stronger evidence of shaping LLM behaviour, particularly when queries were posed in the state's primary language rather than alternative languages.

Source: Nature

China Case Study Reveals 41-Times Higher Presence

To trace the mechanism, researchers focused on China, comparing state-coordinated media with the CulturaX dataset—a 6.3-trillion token multilingual dataset derived from Common Crawl2

. The findings were stark: state-scripted news content appeared in this common training data set 41 times more often than content from Chinese-language Wikipedia1

. Among documents mentioning Chinese political leaders or institutions, the share rose as high as 23%3

. Only about 12% of matched documents came from known government or news domains, suggesting state-coordinated media had spread widely across the Chinese internet content before reaching AI training corpora3

Commercial Models Reproduce Distinctive State Phrases

The research team found that commercial models memorized and reproduced distinctive phrases from state-scripted news content, indicating repeated exposure during training3

. "State-coordinated content is not just about what appears in official media. It is also about recirculation; the same phrasing moving through newspapers, apps, reposts and ordinary web pages until it looks like part of the broader information environment," explained Brandon M. Stewart, the paper's corresponding author and Associate Professor of Sociology at Princeton University3

Training Experiments Confirm Causal Link

To establish causation, researchers conducted pretraining experiments with open-weight LLMs. They added state-scripted news to the training process and measured behavioral changes. The results demonstrated clear influence on AI models: adding scripted news made models produce pro-government responses nearly 80% of the time compared with an unmodified model3

. This held true even when compared to other non-scripted Chinese media, and especially compared to adding general Chinese-language text from the internet.

ChatGPT and Claude Show Language-Based Bias

Testing commercial models revealed systematic patterns of biased output in LLMs. ChatGPT and Claude gave more favorable answers about institutions and leaders in China when queries were written in Chinese versus English1

. Human raters judged Chinese-prompted answers to be more favorable to China 75.3% of the time for political questions about China3

. For prompts not about China, the rate was no different from chance, demonstrating specificity in the influence.

Authoritarian Advantages in Information Environment

The research highlights concerning asymmetries between authoritarian and democratic states. "Authoritarian governments have advantages over democracies in this respect, because state outlets flood the information environment with their material, whereas state-owned organizations in democratic states typically have to compete with opposition or commercial media," the researchers noted1

. This structural advantage allows authoritarian states to exert disproportionate influence through sheer volume of coordinated content.

Transparency Gaps Limit Understanding

The study underscores critical needs for transparency from technology companies. "Researchers and users of LLMs could better understand the scale of institutional influence if we could directly examine the training data and model parameters of commercial LLMs," the authors emphasized1

. Hannah Waight, co-first author and Assistant Professor of Sociology at University of Oregon, explained: "People often talk about AI as if it learns from the internet in some neutral way. It doesn't. It learns from information environments that have already been shaped by institutions and power structures, and those environments can leave measurable traces in what models say"3

Future Implications for Generative AI

The institutional influence mechanism documented in this research carries significant implications for generative AI development. The study focused on text, but researchers expect similar patterns in multimodal models processing images and video1

. Joshua Tucker, co-author and co-Director of the NYU Center for Social Media, AI, and Politics, noted: "The public debate has focused on what AI can generate, but this study points upstream. Before AI systems can influence politics, politics can influence AI"3

State media control shapes what AI chatbots say by flooding training data with biased content

State Media Control Leaves Detectable Traces in AI Systems

Evidence Spans 37 Countries and Multiple Studies

China Case Study Reveals 41-Times Higher Presence

Commercial Models Reproduce Distinctive State Phrases

Training Experiments Confirm Causal Link

ChatGPT and Claude Show Language-Based Bias

Authoritarian Advantages in Information Environment

Transparency Gaps Limit Understanding

Future Implications for Generative AI

References

State media control shapes LLM behaviour by influencing training data

State media control influences large language models - Nature

Governments May Shape What AI Chatbots Say by Shaping the Web They Learn From | Newswise

Related Stories

AI-Generated Content Threatens Accuracy of Large Language Models

AI Chatbots Sway Political Opinions, UW Study Reveals Potential Benefits and Risks

AI influence drives thought homogenization, flattening how millions write and reason

Recent Highlights

Meta AI chatbot exploited by hackers to hijack high-profile Instagram accounts worth millions

Florida sues OpenAI and Sam Altman over ChatGPT safety, alleging AI harms linked to violence

Nvidia RTX Spark chips power new AI laptops with up to 128GB memory and local agent capabilities

Recent Highlights

Today's Top Stories

Anthropic calls for global AI development slowdown as models approach recursive self-improvement

Bot Traffic Surpasses Human Activity as AI Agents Reshape the Internet Faster Than Expected

ChatGPT's Dreaming V3 memory upgrade lets it remember you better across conversations

Cambridge researchers trial first AI-designed vaccine to protect against future pandemics