OpenAI Accused of Training GPT-4o on Copyrighted O'Reilly Books Without Permission

6 Sources

A new study by the AI Disclosures Project suggests that OpenAI may have used paywalled O'Reilly Media books to train its GPT-4o model without proper licensing, raising concerns about copyright infringement and the need for transparency in AI training data sources.

News article

AI Watchdog Accuses OpenAI of Copyright Infringement

A new study by the AI Disclosures Project, a nonprofit co-founded by Tim O'Reilly and Ilan Strauss, has accused OpenAI of training its GPT-4o model on copyrighted O'Reilly Media books without permission 1. The research, which used a method called DE-COP, suggests that OpenAI's latest model demonstrates strong recognition of paywalled O'Reilly book content compared to earlier models 1.

Study Methodology and Findings

The researchers used 13,962 paragraph excerpts from 34 O'Reilly books to probe GPT-4o, GPT-3.5 Turbo, and other OpenAI models 1. The study found that GPT-4o "recognized" far more paywalled O'Reilly book content than older models, even after accounting for potential confounding factors 1.

Implications and Industry Trends

This accusation comes amid ongoing debates about AI companies' use of copyrighted material for training purposes. OpenAI has been advocating for looser restrictions on developing models using copyrighted data 2. The company has some content licensing deals in place but faces several lawsuits over its training data practices 1.

Broader Copyright Concerns in AI Training

A separate study by researchers from the University of Washington, the University of Copenhagen, and Stanford proposed a new method for identifying training data "memorized" by models 2. This study suggested that GPT-4 showed signs of having memorized portions of popular fiction books and New York Times articles 2.

Industry Response and Future Implications

The findings highlight the need for increased transparency regarding pre-training data sources and the development of formal licensing frameworks for AI content training 3. There are concerns that failure to adequately compensate creators could lead to a decline in internet content quality and diversity 3.

OpenAI's Position and Industry Trends

OpenAI has been seeking higher-quality training data and has hired experts in various domains to fine-tune its models' outputs 1. The company has also urged the US government to relax copyright restrictions to facilitate AI model training 3.

As the AI industry grapples with these issues, some companies are introducing measures to protect copyrighted material. For instance, Cloudflare has developed an AI-powered system designed to deter unauthorized web scraping 3.

Explore today's top stories

Apple Considers Partnering with OpenAI or Anthropic to Boost Siri's AI Capabilities

Apple is reportedly in talks with OpenAI and Anthropic to potentially use their AI models to power an updated version of Siri, marking a significant shift in the company's AI strategy.

TechCrunch logoThe Verge logoTom's Hardware logo

29 Sources

Technology

19 hrs ago

Apple Considers Partnering with OpenAI or Anthropic to

Cloudflare Launches Pay-Per-Crawl Feature to Monetize AI Bot Access

Cloudflare introduces a new tool allowing website owners to charge AI companies for content scraping, aiming to balance content creation and AI innovation.

Ars Technica logoTechCrunch logoMIT Technology Review logo

10 Sources

Technology

3 hrs ago

Cloudflare Launches Pay-Per-Crawl Feature to Monetize AI

Elon Musk's xAI Secures $10 Billion in Funding, Intensifying AI Competition

Elon Musk's AI company, xAI, has raised $10 billion in a combination of debt and equity financing, signaling a major expansion in AI infrastructure and development amid fierce industry competition.

TechCrunch logoReuters logoCNBC logo

5 Sources

Business and Economy

11 hrs ago

Elon Musk's xAI Secures $10 Billion in Funding,

Google Unveils Comprehensive AI Tools for Education with Gemini and NotebookLM

Google announces a major expansion of AI tools for education, including Gemini for Education and NotebookLM, aimed at enhancing learning experiences for students and supporting educators in classroom management.

TechCrunch logoThe Verge logoAndroid Police logo

8 Sources

Technology

19 hrs ago

Google Unveils Comprehensive AI Tools for Education with

NVIDIA's GB300 Blackwell Ultra AI Servers Set to Revolutionize AI Computing in Late 2025

NVIDIA's upcoming GB300 Blackwell Ultra AI servers, slated for release in the second half of 2025, are poised to become the most powerful AI servers globally. Major Taiwanese manufacturers are vying for production orders, with Foxconn securing the largest share.

TweakTown logoWccftech logo

2 Sources

Technology

11 hrs ago

NVIDIA's GB300 Blackwell Ultra AI Servers Set to
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Twitter logo
Instagram logo
LinkedIn logo