OpenAI Accused of Training GPT-4o on Copyrighted O'Reilly Books Without Permission

Curated by THEOUTPOST

On Wed, 2 Apr, 4:02 PM UTC

6 Sources

Share

A new study by the AI Disclosures Project suggests that OpenAI may have used paywalled O'Reilly Media books to train its GPT-4o model without proper licensing, raising concerns about copyright infringement and the need for transparency in AI training data sources.

AI Watchdog Accuses OpenAI of Copyright Infringement

A new study by the AI Disclosures Project, a nonprofit co-founded by Tim O'Reilly and Ilan Strauss, has accused OpenAI of training its GPT-4o model on copyrighted O'Reilly Media books without permission 1. The research, which used a method called DE-COP, suggests that OpenAI's latest model demonstrates strong recognition of paywalled O'Reilly book content compared to earlier models 1.

Study Methodology and Findings

The researchers used 13,962 paragraph excerpts from 34 O'Reilly books to probe GPT-4o, GPT-3.5 Turbo, and other OpenAI models 1. The study found that GPT-4o "recognized" far more paywalled O'Reilly book content than older models, even after accounting for potential confounding factors 1.

Implications and Industry Trends

This accusation comes amid ongoing debates about AI companies' use of copyrighted material for training purposes. OpenAI has been advocating for looser restrictions on developing models using copyrighted data 2. The company has some content licensing deals in place but faces several lawsuits over its training data practices 1.

Broader Copyright Concerns in AI Training

A separate study by researchers from the University of Washington, the University of Copenhagen, and Stanford proposed a new method for identifying training data "memorized" by models 2. This study suggested that GPT-4 showed signs of having memorized portions of popular fiction books and New York Times articles 2.

Industry Response and Future Implications

The findings highlight the need for increased transparency regarding pre-training data sources and the development of formal licensing frameworks for AI content training 3. There are concerns that failure to adequately compensate creators could lead to a decline in internet content quality and diversity 3.

OpenAI's Position and Industry Trends

OpenAI has been seeking higher-quality training data and has hired experts in various domains to fine-tune its models' outputs 1. The company has also urged the US government to relax copyright restrictions to facilitate AI model training 3.

As the AI industry grapples with these issues, some companies are introducing measures to protect copyrighted material. For instance, Cloudflare has developed an AI-powered system designed to deter unauthorized web scraping 3.

Continue Reading
Former OpenAI Researcher Condemns Company's Data Practices,

Former OpenAI Researcher Condemns Company's Data Practices, Alleging Copyright Violations

Suchir Balaji, a former OpenAI employee, speaks out against the company's data scraping practices, claiming they violate copyright law and pose a threat to the internet ecosystem.

PetaPixel logoFuturism logoThe Japan Times logoGizmodo logo

6 Sources

PetaPixel logoFuturism logoThe Japan Times logoGizmodo logo

6 Sources

AI Giants Heavily Rely on Premium Publisher Content for LLM

AI Giants Heavily Rely on Premium Publisher Content for LLM Training, Raising Copyright Concerns

New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.

CNET logoPC Magazine logo

2 Sources

CNET logoPC Magazine logo

2 Sources

OpenAI and Google Push for Relaxed Copyright Laws in AI

OpenAI and Google Push for Relaxed Copyright Laws in AI Development

OpenAI and Google advocate for looser copyright restrictions on AI training data in their proposals for the US government's AI Action Plan, citing the need to compete with China and promote innovation.

Ars Technica logoTechCrunch logoZDNet logoThe Verge logo

25 Sources

Ars Technica logoTechCrunch logoZDNet logoThe Verge logo

25 Sources

OpenAI Accidentally Deletes Potential Evidence in Copyright

OpenAI Accidentally Deletes Potential Evidence in Copyright Lawsuit with The New York Times

OpenAI faces challenges in a copyright lawsuit as it accidentally erases crucial data during the discovery process, leading to delays and complications in the legal battle with The New York Times and Daily News.

Ars Technica logoTechCrunch logoMediaNama logoMashable logo

13 Sources

Ars Technica logoTechCrunch logoMediaNama logoMashable logo

13 Sources

OpenAI Denies Copyright Infringement Allegations in Author

OpenAI Denies Copyright Infringement Allegations in Author Lawsuits

OpenAI, the company behind ChatGPT, has responded to copyright infringement lawsuits filed by authors, denying allegations and asserting fair use. The case highlights the ongoing debate surrounding AI and intellectual property rights.

The Economic Times logoEconomic Times logoThe Hindu logo

3 Sources

The Economic Times logoEconomic Times logoThe Hindu logo

3 Sources

TheOutpost.ai

Your one-stop AI hub

The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.

© 2025 TheOutpost.AI All rights reserved