AI Giants Heavily Rely on Premium Publisher Content for LLM Training, Raising Copyright Concerns

2 Sources

New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.

News article

AI Companies Prioritize Premium Content for LLM Training

A new study by Ziff Davis has revealed that major AI companies, including OpenAI, Google, Meta, and Anthropic, heavily rely on content from premium publishers to train their large language models (LLMs). This practice has raised concerns about copyright infringement and fair compensation for content creators 1.

Key Findings of the Research

The research, conducted by Ziff Davis' George Wukoson and Joey Fortuna, found that AI companies intentionally filter out low-quality content in favor of high-quality, human-made content for their training data. They use websites' domain authority, essentially their ranking in Google search, to make these distinctions 1.

Analysis of older model disclosures showed that URLs from top-end publishers made up 12.04% of training data in the OpenWebText2 dataset, which was used to train GPT-3 1.

Implications for Publishers and AI Companies

This revelation has significant implications for both publishers and AI companies. Publishers argue that AI companies are using their copyrighted work without permission or compensation. Several media companies, including The New York Times, have sued AI companies for copyright infringement 1.

On the other hand, AI companies have seen tremendous valuations amid the AI revolution. Google is currently valued at about $2.2 trillion, and Meta at about $1.5 trillion, partly due to their work with generative AI. Startups OpenAI and Anthropic are valued at $157 billion and $40 billion, respectively 1.

Legal and Ethical Considerations

The use of copyrighted content for AI training has led to legal challenges. While a federal judge recently dismissed a lawsuit against OpenAI from Raw Story and AlterNet, a related case filed by The New York Times is ongoing 2.

Some AI companies have started to address these concerns by signing licensing deals with publishers. OpenAI, for instance, has inked deals with the Financial Times, DotDash Meredith, and Vox, among others 1.

Transparency and Future Implications

The lack of transparency from AI companies about their training data sources has been a point of contention. This opacity not only affects publishers but also raises concerns for consumers who don't have visibility into the reliability and potential biases of the information powering AI chatbots 1.

As the AI industry continues to evolve, the findings of this study could provide media companies with more leverage when seeking copyright protection or compensation for their content used in AI training. It also highlights the importance of high-quality journalism in the AI era and the potential threats to the continuous flow of reliable information if publishers are not fairly compensated 2.

Explore today's top stories

NASA and IBM Unveil Surya: An AI Model for Predicting Solar Weather

NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather, potentially improving the protection of Earth's critical infrastructure from solar storms.

New Scientist logoengadget logoGizmodo logo

5 Sources

Technology

1 hr ago

NASA and IBM Unveil Surya: An AI Model for Predicting Solar

Meta Launches AI-Powered Voice Translation for Facebook and Instagram Creators

Meta introduces an AI-driven voice translation feature for Facebook and Instagram creators, enabling automatic dubbing of content from English to Spanish and vice versa, with plans for future language expansions.

TechCrunch logoCNET logoThe Verge logo

8 Sources

Technology

17 hrs ago

Meta Launches AI-Powered Voice Translation for Facebook and

OpenAI's GPT-6: Revolutionizing AI with Memory and Personalization

OpenAI CEO Sam Altman reveals plans for GPT-6, focusing on memory capabilities to create more personalized and adaptive AI interactions. The upcoming model aims to remember user preferences and conversations, potentially transforming the relationship between humans and AI.

CNBC logoTom's Guide logo

2 Sources

Technology

17 hrs ago

OpenAI's GPT-6: Revolutionizing AI with Memory and

DeepSeek and Baidu: China's Open-Source AI Revolution Challenges Western Dominance

Chinese AI companies DeepSeek and Baidu are making waves in the global AI landscape with their open-source models, challenging the dominance of Western tech giants and potentially reshaping the AI industry.

TechRadar logoVentureBeat logo

2 Sources

Technology

1 hr ago

DeepSeek and Baidu: China's Open-Source AI Revolution

The Rise of 'AI Psychosis': Mental Health Concerns Grow as AI Chatbots Proliferate

A comprehensive look at the emerging phenomenon of 'AI psychosis', its impact on mental health, and the growing concerns among experts and tech leaders about the psychological risks associated with AI chatbots.

Gizmodo logoFuturism logoThe Telegraph logo

3 Sources

Technology

1 hr ago

The Rise of 'AI Psychosis': Mental Health Concerns Grow as
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo