Curated by THEOUTPOST
On Sat, 19 Oct, 12:08 AM UTC
6 Sources
[1]
Penguin Random House is adding an AI warning to its books' copyright pages
Penguin Random House, the world's largest trade publisher, will be adding language to the copyright pages of its books to prohibit the use of those books to train AI. The Bookseller reports that new books and reprints of older titles from the publisher will now include the statement, "No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems." While the use of copyrighted material to train AI models is currently being fought over in multiple lawsuits, Penguin Random House appears to be the first major publisher to update its copyright pages to reflect these new concerns. On its own, this update doesn't necessarily change the legal status of a text (i.e, it's not implying that using a book for AI training was totally fine until now). Nor does it necessarily mean Penguin Random House is completely opposed to the use of AI in book publishing. In August, the publisher outlined an initial approach to generative AI, saying it will "vigorously defend the intellectual property that belongs to our authors and artists" while also promising to "use generative AI tools selectively and responsibly, where we see a clear case that they can advance our goals."
[2]
Penguin Random House copyright pages will now forbid AI training
Credit: Idrees Abbas/SOPA Images/LightRocket via Getty Images Penguin Random House (PRH), the largest of the Big Five publishing imprints, is pushing back against its published works being used to train AI. As first reported by The Bookseller, PRH has changed its copyright wording to target AI. The new rules state that "no part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems." This statement will appear in all new titles across PRH's imprints, as well as reprints of backlist titles. PRH's changing of its copyright wording to combat AI training makes it the first of the Big Five publishers to take such an action against AI, at least publicly. Mashable has reached out to the remaining Big Five trade publishers -- Hachette, HarperCollins, Macmillan, and Simon & Schuster -- for comment. PRH's move is the latest in a series of copyright actions by publishers against AI scraping. In late 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, and in Oct. of 2024, they also sent a cease and desist letter to the Jeff Bezos-backed AI startup Perplexity. And with companies allowing seemingly anything to be trained for AI, from X posts to LinkedIn data, who can blame them?
[3]
Major Publisher Penguin Random House Blocks AI Training on Its Books
Penguin Random House, one of the world's largest publishers, has taken action to block firms from training AI systems on its huge portfolio, publishing trade The Bookseller reports. AI firms often trawl or "scrape" sources like fiction and non-fiction books, newspapers, and social media to train their AI models, which has already caused plenty of legal controversies. Alongside Simon & Schuster, Hachette, HarperCollins, and Macmillan Publishers, Penguin Random House is considered one of the 'Big Five' English language publishers. These are thought to control 80% of the U.S. book trade as of 2022. Penguin has amended the copyright wording which appears on all its titles worldwide, across all its imprints. It now reads: "No part of this book may be used or reproduced in any manner to train artificial intelligence technologies or systems". According to The Bookseller, the new wording will appear on all its new titles and any reprinted old titles. The statement also calls on a European Parliament directive released earlier this year, which gives copyright holders the right for their material to be protected from text or data mining by AI firms, as long as their work has been opted out from being used by AI. It's not just book publishing that is lashing out against AI firms allegedly profiting off their material, it's a big issue in other industries. In December 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, claiming that millions of its articles were used to train the companies' AI models. However, not all of the world's largest book publishers are taking such a hardline approach to how their material is used. Wiley, Oxford University Press, and Taylor & Francis have all signed agreements that allow their content to be used to train AI, under certain conditions, according to The Bookseller earlier this year. In a statement to the trade publication, copyright lawyer Chien‑Wei Lui, a senior associate at Fox Williams LLP, broadly supported the recent change to the copyright warning. "The more training that is being done on a non-contractual/licence basis, the greater the risk that author content is being devalued," she said. "Why would a platform pay to license content for training purposes if it suspects that content is already 'out there'? The lawyer added: "Publishers need to ensure they understand all the tools at their disposal to limit the ability for third parties to use their content for training purposes. Having a clear and advertised statement about reserving all training and text and data mining rights, for example, is helpful."
[4]
Penguin Random House amends its copyright rules to protect authors from AI
The new warning prohibits AI companies from training on the publisher's books. Artificial intelligence makers have faced a mountain of criticism for borrowing from the work of others to train its models. Now the world's largest publishing house is taking steps to ensure its authors don't have their work plagiarized in the name of progress. reports that Penguin Random House Publishing changed the copyright page at the front of its books to address using any of its titles as a source for AI training. Now the wording states: "No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems." The new wording also protects against data absorption by noting the publisher "expressly reserves [the titles] from the text and data mining exception." This part of the amended text comes from a recent regarding text and data mining exceptions and ownership. Penguin Random House is the latest publishing company to take action against encroaching AI models. Earlier this week, issued a cease and desist letter to the AI startup Perplexity to spot using its articles and stories to help its AI model create answers for users.
[5]
Penguin Adds a Do-Not-Scrape-for-AI Page to Its Books
The world's largest publishing house doesn't want its copyrighted work used to train AI models. Taking a firm stance against tech companies' unlicensed use of its authors' works, the publishing giant Penguin Random House will change the language on all of its books' copyright pages to expressly prohibit their use in training artificial intelligence systems, according to reporting by The Bookseller. It's a notable departure from other large publishers, such as academic printers Taylor & Francis, Wiley, and Oxford University Press, which have all agreed to license their portfolios to AI companies. Matthew Sag, an AI and copyright expert at Emory University School of Law, said Penguin Random House's new language appears to be directed at the European Union market but could also impact how AI companies in the U.S. use its material. Under EU law, copyright holders can opt-out of having their work data mined. While that right isn't enshrined in U.S. law, the largest AI developers generally don't scrape content behind paywalls or content excluded by sites' robot.txt files. "You would think there is no reason they should not respect this kind of opt out [that Penguin Random House is including in its books] so long as it is a signal they can process at scale," Sag said. Dozens of authors and media companies have filed lawsuits in the U.S. against Google, Meta, Microsoft, OpenAI, and other AI developers accusing them of violating the law by training large language models on copyrighted work. The tech companies argue that their actions fall under the fair use doctrine, which allows for the unlicensed use of copyrighted material in certain circumstancesâ€"for example, if the derivative work substantially transforms the original content or if it's used for criticism, news reporting, or education. U.S. Courts haven't yet decided whether feeding a book into a large language model constitutes fair use. Meanwhile, social media trends in which users post messages telling tech platforms not to train AI models on their content have been predictably unsuccessful. Penguin Random House's no-training message is a bit different from those optimistic copypastas. For one thing, social media users have to agree to a platform's terms of service, which invariably allows their content to be used to train AI. For another, Penguin Random House is a wealthy international publisher that can back up its message with teams of lawyers. The Bookseller reported that the publisher's new copyright pages will read, in part: "No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems. In accordance with Article 4(3) of the Digital Single Market Directive 2019/790, Penguin Random House expressly reserves this work from the text and data mining exception.†Tech companies are happy to mine the internet, particularly sites like Reddit, for language datasets but the quality of that content tends to be poorâ€"full of bad advice, racism, sexism, and all the other isms, contributing to bias and inaccuracies in the resulting models. AI researchers have said that books are among the most desirable training data for models due to the quality of writing and fact-checking. If Penguin Random House can successfully wall off its copyrighted content from large language models it could have a significant impact on the generative AI industry, forcing developers to either start paying for high-quality contentâ€"which would be a blow to business models reliant on using other people's work for freeâ€"or try to sell customers on models trained on low-quality internet content and outdated published material. "The endgame for companies like Penguin Random House opting out of AI training may be to satisfy the interests of authors who are opposed to their works being used as training data for any reason, but it is probably so that the publishing company can turn around and [start] charging license fees for access to training data," Sag said. "If this is the world we end up in, AI companies will continue to train on the 'open Internet' but anyone who is in control of a moderately large pile of text will want to opt out and charge for access. That seems like a pretty good compromise which lets publishers and websites monetize access without creating impossible transaction costs for AI training in general."
[6]
Penguin Random House books now explicitly say 'no' to AI training
What gets printed on that page might be a warning shot, but it also has little to do with actual copyright law. The amended page is sort of like Penguin Random House's version of a robots.txt file, which websites will sometimes use to ask AI companies and others not to scrape their content. But robots.txt isn't a legal mechanism; it's a voluntarily-adopted norm across the web. Copyright protections exist regardless of whether the copyright page is slipped into the front of the book, and fair use and other defenses (if applicable!) also exist even if the rights holder says they do not.
Share
Share
Copy Link
Penguin Random House, the world's largest trade publisher, has updated its copyright pages to prohibit the use of its books for training AI systems, marking a significant move in the ongoing debate over AI and copyright.
Penguin Random House (PRH), the world's largest trade publisher, has taken a significant step to protect its authors' intellectual property by adding new language to the copyright pages of its books. The updated text explicitly prohibits the use of their publications for training artificial intelligence technologies or systems 1.
The amended copyright statement now reads: "No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems" 2. This change will be implemented across all PRH imprints, affecting both new titles and reprints of older works.
While this update doesn't necessarily alter the legal status of the texts, it represents a clear stance against the unauthorized use of copyrighted material in AI development. PRH is the first among the "Big Five" English language publishers to take such a public action 3.
The move aligns with a European Parliament directive that grants copyright holders the right to protect their material from text or data mining by AI firms, provided they opt out of such usage 3. This could have far-reaching implications for AI companies that rely on vast amounts of text data for training their models.
Copyright lawyer Chien-Wei Lui supports PRH's decision, stating that it helps preserve the value of author content and encourages proper licensing practices 3. However, not all publishers are taking the same approach. Some, like Wiley, Oxford University Press, and Taylor & Francis, have signed agreements allowing their content to be used for AI training under certain conditions 3.
This development is part of a larger trend of content creators and publishers pushing back against the use of their work in AI training. In late 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, claiming that millions of its articles were used to train AI models without permission 4.
PRH's stance could significantly impact the AI industry. Books are considered high-quality training data due to their well-written and fact-checked content. If other major publishers follow suit, AI companies may be forced to either pay for access to quality content or rely on potentially lower-quality, freely available internet data 5.
Matthew Sag, an AI and copyright expert, suggests that this move by PRH could lead to a new model where publishers and websites monetize access to their content for AI training purposes. This compromise could allow AI companies to continue training on the "open Internet" while enabling content owners to benefit from the use of their work in AI development 5.
Reference
HarperCollins has reached an agreement with an unnamed AI company to use select nonfiction books for AI model training, offering authors $2,500 per book. The deal highlights growing tensions between publishers, authors, and AI firms over copyright and compensation.
7 Sources
7 Sources
Meta is embroiled in a lawsuit alleging the company used pirated books to train its AI models, including Llama. Internal communications reveal ethical concerns and attempts to conceal the practice.
11 Sources
11 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Microsoft has entered into a licensing agreement with HarperCollins to use nonfiction books for training an unreleased AI model, aiming to improve model quality and performance without generating AI-written books.
6 Sources
6 Sources
A group of authors has filed a lawsuit against AI company Anthropic, alleging copyright infringement in the training of their AI chatbot Claude. The case highlights growing concerns over AI's use of copyrighted material.
14 Sources
14 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved