Curated by THEOUTPOST
On Wed, 2 Apr, 4:02 PM UTC
6 Sources
[1]
Researchers suggest OpenAI trained AI models on paywalled O'Reilly books | TechCrunch
OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didn't license to train more sophisticated AI models. AI models are essentially complex prediction engines. Trained on a lot of data -- books, movies, TV shows, and so on -- they learn patterns and novel ways to extrapolate from a simple prompt. When a model "writes" an essay on a Greek tragedy or "draws" Ghibli-style images, it's simply pulling from its vast knowledge to approximate. It isn't arriving at anything new. While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That's likely because training on purely synthetic data comes with risks, like worsening a model's performance. The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O'Reilly Media. (O'Reilly is the CEO of O'Reilly Media.) In ChatGPT, GPT-4o is the default model. O'Reilly doesn't have a licensing agreement with OpenAI, the paper says. "GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content [...] compared to OpenAI's earlier model GPT-3.5 Turbo," wrote the co-authors of the paper. "In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples." The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models' training data. Also known as a "membership inference attack," the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data. The co-authors of the paper -- O'Reilly, Strauss, and AI researcher Sruly Rosenblat -- say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models' knowledge of O'Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O'Reilly books to estimate the probability that a particular excerpt had been included in a model's training dataset. According to the results of the paper, GPT-4o "recognized" far more paywalled O'Reilly book content than OpenAI's older models, including GPT-3.5 Turbo. That's even after accounting for potential confounding factors, the authors said, like improvements in newer models' ability to figure out whether text was human-authored. "GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors. It isn't a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn't foolproof, and that OpenAI might've collected the paywalled book excerpts from users copying and pasting it into ChatGPT. Muddying the waters further, the co-authors didn't evaluate OpenAI's most recent collection of models, which includes GPT-4.5 and "reasoning" models such as o3-mini and o1. It's possible that these models weren't trained on paywalled O'Reilly book data, or were trained on a lesser amount than GPT-4o. That being said, it's no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models' outputs. That's a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems. It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms -- albeit imperfect ones -- that allow copyright owners to flag content they'd prefer the company not use for training purposes. Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O'Reilly paper isn't the most flattering look.
[2]
OpenAI's models 'memorized' copyrighted content, new study suggests | TechCrunch
A new study appears to lend credence to allegations that OpenAI trained at least some of its AI models on copyrighted content. OpenAI is embroiled in suits brought by authors, programmers, and other rights-holders who accuse the company of using their works -- books, codebases, and so on -- to develop its models without permission. OpenAI has long claimed a fair use defense, but the plaintiffs in these cases argue that there isn't a carve-out in U.S. copyright law for training data. The study, which was co-authored by researchers at the University of Washington, the University of Copenhagen, and Stanford, proposes a new method for identifying training data "memorized" by models behind an API, like OpenAI's. Models are prediction engines. Trained on a lot of data, they learn patterns -- that's how they're able to generate essays, photos, and more. Most of the outputs aren't verbatim copies of the training data, but owing to the way models "learn," some inevitably are. Image models have been found to regurgitate screenshots from movies they were trained on, while language models have been observed effectively plagiarizing news articles. The study's method relies on words that the co-authors call "high-surprisal" -- that is, words that stand out as uncommon in the context of a larger body of work. For example, the word "radar" in the sentence "Jack and I sat perfectly still with the radar humming" would be considered high-surprisal because it's statistically less likely than words such as "engine" or "radio" to appear before "humming." The co-authors probed several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization by removing high-surprisal words from snippets of fiction books and New York Times pieces and having the models try to "guess" which words had been masked. If the models managed to guess correctly, it's likely they memorized the snippet during training, concluded the co-authors. According to the results of the tests, GPT-4 showed signs of having memorized portions of popular fiction books, including books in a dataset containing samples of copyrighted ebooks called BookMIA. The results also suggested that the model memorized portions of New York Times articles, albeit at a comparatively lower rate. Abhilasha Ravichander, a doctoral student at the University of Washington and a co-author of the study, told TechCrunch that the findings shed light on the "contentious data" models might have been trained on. "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically," Ravichander said. "Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem." OpenAI has long advocated for looser restrictions on developing models using copyrighted data. While the company has certain content licensing deals in place and offers opt-out mechanisms that allow copyright owners to flag content they'd prefer the company not use for training purposes, it has lobbied several governments to codify "fair use" rules around AI training approaches.
[3]
Study suggests OpenAI isn't waiting for copyright exemption
GPT-4o likely trained on O'Reilly books without permission, figures appear to show Tech textbook tycoon Tim O'Reilly claims OpenAI mined his publishing house's copyright-protected tomes for training data and fed it all into its top-tier GPT-4o model without permission. This comes as the generative AI upstart faces lawsuits over its use of copyrighted material, allegedly without due consent or compensation, to train its GPT-family of neural networks. OpenAI denies any wrongdoing. O'Reilly (the man) is one of three authors of a study [PDF] titled, "Beyond Public Access in LLM Pre-Training Data: Non-public book content in OpenAI's Models," issued by the AI Disclosures Project. By non-public, the authors mean books that are available for humans from behind a paywall, and aren't publicly available to read for free unless you count sites that illegally pirate this kind of material. The trio set out to determine whether GPT-4o had, without the publisher's permission, ingested 34 copyrighted O'Reilly Media books. To probe the model, which powers the world-famous ChatGPT, they performed so-called DE-COP inference attacks described in this 2024 pre-press paper. Here's how that worked: The team posed OpenAI's model a string of multiple choice questions. Each question asked the software to select from a group of paragraphs, labeled A to D, the one that is a verbatim passage of text from a given O'Reilly (the publisher) book. One of the options was lifted straight from the book, the others machine-generated paraphrases of the original. If the OpenAI model tended to answer correctly, and identify the verbatim paragraphs, that suggested it was probably trained on that copyrighted text. More specifically, the model's choices were used to calculate what's dubbed an Area Under the Receiver Operating Characteristic (AUROC) score, with higher figures indicating a greater likelihood the neural network was trained on passages from the 34 O'Reilly books. Scores closer to 50 percent, meanwhile, were considered an indication that the model hadn't been trained on the data. Testing of OpenAI models GPT-3.5 Turbo and GPT-4o Mini, as well as GPT-4o, across 13,962 paragraphs uncovered mixed results. GPT-4o, which was released in May 2024, scored 82 percent, a strong signal it was likely trained on the publisher's material. The researchers speculated OpenAI may have trained the model using the LibGen database, which contains all 34 of the books tested. You may recall Meta has also been accused of training its Llama models using this notorious dataset. The role of non-public data in OpenAI's model pre-training data has increased significantly over time The AUROC score for 2022's GPT-3.5 model came in at just above 50 percent. The researchers asserted that the higher score for GPT-4o is evidence that "the role of non-public data in OpenAI's model pre-training data has increased significantly over time." However the trio also found that the smaller GPT-4o Mini model, also released in 2024 after a training process that ended at the same time as the full GPT-4o model, wasn't seemingly trained on O'Reilly books. They think that's not an indicator their tests are flawed, but that the smaller parameter count in the mini-model may impact its ability to "remember" text. "These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training," the authors wrote. "Although the evidence present here on model access violations is specific to OpenAI and O'Reilly Media books, this is likely a systematic issue," they added. The trio - which included Sruly Rosenblat and Ilan Strauss - also warned that a failure to adequately compensate creators for their works could result in - and if you can pardon the jargon - the enshittification of the entire internet. "If AI companies extract value from a content creator's produced materials without fairly compensating the creator, they risk depleting the very resources upon which their AI systems depend," they argued. "If left unaddressed, uncompensated training data could lead to a downward spiral in the internet's content quality and diversity." Uncompensated training data could lead to a downward spiral in the internet's content quality and diversity AI giants seem to know they can't rely on internet scraping to find the material they need to train models, as they have started signing content licensing agreements with publishers and social networks. Last year, OpenAI inked deals with Reddit and Time Magazine to access their archives for training purposes. Google also did a deal with Reddit. Recently, however, OpenAI has urged the US government to relax copyright restrictions in ways that would make training AI models easier. Last month, the super-lab submitted an open letter to the White House Office of Science and Technology in which it argued that "rigid copyright rules are repressing innovation and investment," and that if action isn't taken to change this, Chinese model builders could surpass American companies. While model-makers apparently struggle, lawyers are doing well. As we recently reported, Thomson Reuters won a partial summary judgment against Ross Intelligence after a US court found the startup had infringed copyright by using the newswire's Westlaw's headnotes to train its AI system. While neural network trainers push for unfettered access, others in the tech world are introducing roadblocks to protect copyrighted material. Last month Cloudflare rolled out a bot-busting AI designed to make life miserable for scrapers that ignore robots.txt directives. Cloudflare's "AI Labyrinth" works by luring rogue crawler bots into a maze of decoy pages, wasting their time and compute resources while shielding real content. OpenAI didn't immediately respond to a request for comment; we'll let you know if we hear anything back. ®
[4]
An AI Watchdog accused OpenAI of using copyrighted books without permission
An artificial intelligence watchdog is accusing OpenAI of training its default ChatGPT model on copyrighted book content without permission. In a new paper published this week, the AI Disclosures Project alleges that OpenAI likely trained its GPT-4o model using non-public material from O'Reilly Media. The researchers used a legally obtained dataset of 34 copyrighted O'Reilly books and found that GPT-4o showed "strong recognition" of the company's paywalled content. By contrast, GPT-3.5 Turbo appeared more familiar with publicly accessible O'Reilly book samples. "These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training," the authors wrote in the paper. One of the nonprofit's founders and paper's authors, Tim O'Reilly, is the CEO of O'Reilly Media.
[5]
OpenAI might have trained its AI on stolen books
OpenAI is facing accusations of training its AI models on copyrighted material without permission, as a new paper alleges the company used paywalled books from O'Reilly Media to train its GPT-4o model. The AI Disclosures Project, a nonprofit co-founded by Tim O'Reilly and Ilan Strauss, published the paper. AI models function as prediction engines, learning patterns from extensive data like books and movies to extrapolate from prompts. While some AI labs are using AI-generated data as real-world sources diminish, training on purely synthetic data carries risks, such as impacting a model's performance. The paper's methodology, DE-COP, determines if a model distinguishes between human-authored texts and AI-generated paraphrases. This suggests whether the model has prior knowledge from its training data. Researchers probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models, using 13,962 excerpts from 34 O'Reilly books to estimate the probability of inclusion in training datasets. Results indicated GPT-4o recognized significantly more paywalled O'Reilly book content than older models like GPT-3.5 Turbo. According to the paper, GPT-4o likely recognizes many non-public O'Reilly books published before its training cutoff date. O'Reilly doesn't have a licensing agreement with OpenAI, according to the paper. The co-authors acknowledge the method isn't foolproof and OpenAI might have collected excerpts from users' ChatGPT inputs. Another caveat is that more recent OpenAI models, including GPT-4.5, weren't evaluated. OpenAI, advocating for looser copyright restrictions, has sought higher-quality training data, hiring journalists to fine-tune model outputs. The company also has licensing deals with news publishers and offers opt-out mechanisms for copyright owners. OpenAI has not commented on the paper.
[6]
Researchers Claim OpenAI Trained Its AI Models on Copyrighted Content
GPT-4o was said to show the highest recognition of copyrighted content OpenAI might have trained its artificial intelligence (AI) models on copyrighted content, according to a research paper. A recently published paper from the non-profit organisation AI Disclosures Project, the San Francisco-based AI firm's recent large language models (LLMs) showed a higher recognition of copyrighted content compared to its older models. The researchers used a recently developed method called DE-COP to detect copyrighted content in the AI models' training dataset. Notably, the study found that the GPT-4o mini was not trained on the specific copyrighted content. The study, titled Beyond Public Access in LLM Pre-Training Data, was conducted to check if OpenAI's AI models were trained on non-public book content. For the study, researchers focused on O'Reilly Media, a US online learning platform, which contains numerous copyrighted books. The founder of the platform, Tim O'Reilly, was also one of the co-authors of the study. The researchers used DE-COP method to test whether the training data of the AI models contained copyrighted material. This is a relatively new test, introduced in a paper published in 2024. The method, also known as a membership inference attack, quizzes an AI model with a multiple-choice test to see whether it can identify copyrighted content from machine-generated paraphrased alternatives. The researchers used Claude 3.5 Sonnet to paraphrase the copyrighted material. As many as 3,962 paragraph excerpts from 34 O'Reilly Media books were used for the test. Based on the tests conducted, the researchers claimed to have found that the GPT-4o AI model showed the highest recognition of the copyrighted and paywalled O'Reilly book content with an 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. Notably, the AURUC score is part of the DE-COP method and is derived from the guess rates from the multiple-choice test. The study also found that older OpenAI AI models, such as GPT-3.5 Turbo, showed lesser content recognition compared to GPT-4o, but still high enough to be significant. However, GPT-4o mini was found not to be trained on the paywalled O'Reilly Media books. The paper states the reason could be that the test is not effective against smaller language models.
Share
Share
Copy Link
A new study by the AI Disclosures Project suggests that OpenAI may have used paywalled O'Reilly Media books to train its GPT-4o model without proper licensing, raising concerns about copyright infringement and the need for transparency in AI training data sources.
A new study by the AI Disclosures Project, a nonprofit co-founded by Tim O'Reilly and Ilan Strauss, has accused OpenAI of training its GPT-4o model on copyrighted O'Reilly Media books without permission 1. The research, which used a method called DE-COP, suggests that OpenAI's latest model demonstrates strong recognition of paywalled O'Reilly book content compared to earlier models 1.
The researchers used 13,962 paragraph excerpts from 34 O'Reilly books to probe GPT-4o, GPT-3.5 Turbo, and other OpenAI models 1. The study found that GPT-4o "recognized" far more paywalled O'Reilly book content than older models, even after accounting for potential confounding factors 1.
This accusation comes amid ongoing debates about AI companies' use of copyrighted material for training purposes. OpenAI has been advocating for looser restrictions on developing models using copyrighted data 2. The company has some content licensing deals in place but faces several lawsuits over its training data practices 1.
A separate study by researchers from the University of Washington, the University of Copenhagen, and Stanford proposed a new method for identifying training data "memorized" by models 2. This study suggested that GPT-4 showed signs of having memorized portions of popular fiction books and New York Times articles 2.
The findings highlight the need for increased transparency regarding pre-training data sources and the development of formal licensing frameworks for AI content training 3. There are concerns that failure to adequately compensate creators could lead to a decline in internet content quality and diversity 3.
OpenAI has been seeking higher-quality training data and has hired experts in various domains to fine-tune its models' outputs 1. The company has also urged the US government to relax copyright restrictions to facilitate AI model training 3.
As the AI industry grapples with these issues, some companies are introducing measures to protect copyrighted material. For instance, Cloudflare has developed an AI-powered system designed to deter unauthorized web scraping 3.
Reference
[3]
[5]
Suchir Balaji, a former OpenAI employee, speaks out against the company's data scraping practices, claiming they violate copyright law and pose a threat to the internet ecosystem.
6 Sources
6 Sources
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
2 Sources
2 Sources
OpenAI and Google advocate for looser copyright restrictions on AI training data in their proposals for the US government's AI Action Plan, citing the need to compete with China and promote innovation.
25 Sources
25 Sources
OpenAI faces challenges in a copyright lawsuit as it accidentally erases crucial data during the discovery process, leading to delays and complications in the legal battle with The New York Times and Daily News.
13 Sources
13 Sources
OpenAI, the company behind ChatGPT, has responded to copyright infringement lawsuits filed by authors, denying allegations and asserting fair use. The case highlights the ongoing debate surrounding AI and intellectual property rights.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved