Microsoft Removes AI Training Guide Using Pirated Books

Microsoft Removes Controversial AI Training Guide

Microsoft deleted a developer blog post after significant backlash on Hacker News revealed that the guide encouraged developers to train large language models using pirated Harry Potter books1

. The post, authored by senior product manager Pooja Kamath in November 2024, demonstrated how to add generative AI features to applications using Azure SQL DB, LangChain, and LLMs1

. Kamath, who has been with Microsoft for over a decade, linked to a Kaggle dataset containing all seven Harry Potter novels as an "engaging and relatable example" to showcase the company's new feature1

Source: PCWorld

Copyright Infringement Concerns Surface

The Kaggle dataset was incorrectly marked as public domain despite the Harry Potter series being firmly protected under copyright held by J.K. Rowling and various entities worldwide2

. The dataset had been available online for years, accumulating approximately 10,000 downloads before being deleted on Thursday after Ars Technica reached out to the uploader, Shubham Maindola, a data scientist in India with no apparent links to Microsoft1

. Maindola stated that "the dataset was marked as Public Domain by mistake" with no intention to misrepresent the licensing status1

. A complete collection of the novels costs $70 in ebook format on Amazon, making the free distribution a clear case of copyright infringement2

The Microsoft Developer Blog Post Details

The Microsoft developer blog post explained how users could train AI models by downloading the Harry Potter dataset and uploading text files to Azure Blob Storage1

. The guide suggested two primary use cases: building Q&A systems that provide context-rich answers and generating AI-driven fan fiction to "delight Potterheads"1

. Microsoft even uploaded an example model to Azure Blob Storage based on the first book, Harry Potter and the Sorcerer's Stone1

. The blog closed with an LLM-generated image depicting caricatures of Harry Potter and Ron Weasley on a train with a Microsoft logo between them2

Source: Ars Technica

Ethical and Legal Challenges in AI Development

Cathay Y. N. Smith, a law professor and co-director of Chicago-Kent College of Law's Program in Intellectual Property Law, suggested Kamath may not have realized the books were too recent to be in the public domain1

. "Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last," Smith explained1

. The incident underscores significant ethical challenges in AI development when the use of copyrighted material occurs improperly for machine learning training purposes2

Broader Industry Implications and Ongoing Lawsuits

The blog post emerged at a time when AI firms began facing lawsuits over AI model training accused of infringing copyrights by allegedly training on pirated materials and regurgitating works verbatim1

. Authors have filed lawsuits against Meta, OpenAI, Nvidia, Alphabet, Anthropic, Microsoft, and others, seeking to stop training on copyrighted works or obtain remuneration for books already incorporated into LLM training without permission2

. Initial court results have been mixed, with some finding the results of training models "transformative" and thus substantively different from the core data under fair use doctrine, while others maintain that initial acts of piracy must still be prosecuted2

. Microsoft declined to comment on the matter, and Kaggle did not respond to requests for comment1

Microsoft Pulls AI Training Guide After Backlash Over Pirated Harry Potter Books

Microsoft Removes Controversial AI Training Guide

Copyright Infringement Concerns Surface

The Microsoft Developer Blog Post Details

Ethical and Legal Challenges in AI Development

Broader Industry Implications and Ongoing Lawsuits

References

Microsoft removes guide on how to train LLMs on pirated Harry Potter books

Accio Lawyers! Microsoft manager trained AI on pirated Potter books

Related Stories

Microsoft Faces Lawsuit from Authors Over Alleged Use of Pirated Books in AI Training

Legal Battles Over AI Training: Courts Rule on Fair Use, Authors Fight Back

Meta Faces Legal Challenges Over Alleged Use of Pirated Books for AI Training

Recent Highlights

Anthropic restricts Mythos AI model release, citing unprecedented cybersecurity capabilities

US Treasury and Fed summon bank CEOs over Anthropic's Mythos AI model cyber risks

Meta unveils Muse Spark AI model as Superintelligence Labs makes its debut

Recent Highlights

Today's Top Stories

Apple smart glasses to compete with Meta Ray-Bans using AI and privacy-focused camera design

Nations race to deploy AI weapons and autonomous drones as global military competition intensifies

Anthropic launches Claude for Word with legal review as primary focus, challenging Microsoft

Intel and SambaNova unveil heterogeneous AI inference platform to challenge Nvidia's dominance