AI Giants Heavily Rely on Premium Publisher Content for LLM Training, Raising Copyright Concerns

2 Sources

Share

New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.

News article

AI Companies Prioritize Premium Content for LLM Training

A new study by Ziff Davis has revealed that major AI companies, including OpenAI, Google, Meta, and Anthropic, heavily rely on content from premium publishers to train their large language models (LLMs). This practice has raised concerns about copyright infringement and fair compensation for content creators

1

.

Key Findings of the Research

The research, conducted by Ziff Davis' George Wukoson and Joey Fortuna, found that AI companies intentionally filter out low-quality content in favor of high-quality, human-made content for their training data. They use websites' domain authority, essentially their ranking in Google search, to make these distinctions

1

.

Analysis of older model disclosures showed that URLs from top-end publishers made up 12.04% of training data in the OpenWebText2 dataset, which was used to train GPT-3

1

.

Implications for Publishers and AI Companies

This revelation has significant implications for both publishers and AI companies. Publishers argue that AI companies are using their copyrighted work without permission or compensation. Several media companies, including The New York Times, have sued AI companies for copyright infringement

1

.

On the other hand, AI companies have seen tremendous valuations amid the AI revolution. Google is currently valued at about $2.2 trillion, and Meta at about $1.5 trillion, partly due to their work with generative AI. Startups OpenAI and Anthropic are valued at $157 billion and $40 billion, respectively

1

.

Legal and Ethical Considerations

The use of copyrighted content for AI training has led to legal challenges. While a federal judge recently dismissed a lawsuit against OpenAI from Raw Story and AlterNet, a related case filed by The New York Times is ongoing

2

.

Some AI companies have started to address these concerns by signing licensing deals with publishers. OpenAI, for instance, has inked deals with the Financial Times, DotDash Meredith, and Vox, among others

1

.

Transparency and Future Implications

The lack of transparency from AI companies about their training data sources has been a point of contention. This opacity not only affects publishers but also raises concerns for consumers who don't have visibility into the reliability and potential biases of the information powering AI chatbots

1

.

As the AI industry continues to evolve, the findings of this study could provide media companies with more leverage when seeking copyright protection or compensation for their content used in AI training. It also highlights the importance of high-quality journalism in the AI era and the potential threats to the continuous flow of reliable information if publishers are not fairly compensated

2

.

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo