Microsoft Pulls AI Training Guide After Backlash Over Pirated Harry Potter Books

2 Sources

Share

Microsoft removed a developer blog post that instructed users on how to train large language models using pirated Harry Potter books. The post, written by senior product manager Pooja Kamath in November 2024, linked to a Kaggle dataset containing all seven novels incorrectly marked as public domain. The removal highlights growing ethical and legal challenges surrounding AI model training and the use of copyrighted material without permission.

Microsoft Removes Controversial AI Training Guide

Microsoft deleted a developer blog post after significant backlash on Hacker News revealed that the guide encouraged developers to train large language models using pirated Harry Potter books

1

. The post, authored by senior product manager Pooja Kamath in November 2024, demonstrated how to add generative AI features to applications using Azure SQL DB, LangChain, and LLMs

1

. Kamath, who has been with Microsoft for over a decade, linked to a Kaggle dataset containing all seven Harry Potter novels as an "engaging and relatable example" to showcase the company's new feature

1

.

Source: PCWorld

Source: PCWorld

Copyright Infringement Concerns Surface

The Kaggle dataset was incorrectly marked as public domain despite the Harry Potter series being firmly protected under copyright held by J.K. Rowling and various entities worldwide

2

. The dataset had been available online for years, accumulating approximately 10,000 downloads before being deleted on Thursday after Ars Technica reached out to the uploader, Shubham Maindola, a data scientist in India with no apparent links to Microsoft

1

. Maindola stated that "the dataset was marked as Public Domain by mistake" with no intention to misrepresent the licensing status

1

. A complete collection of the novels costs $70 in ebook format on Amazon, making the free distribution a clear case of copyright infringement

2

.

The Microsoft Developer Blog Post Details

The Microsoft developer blog post explained how users could train AI models by downloading the Harry Potter dataset and uploading text files to Azure Blob Storage

1

. The guide suggested two primary use cases: building Q&A systems that provide context-rich answers and generating AI-driven fan fiction to "delight Potterheads"

1

. Microsoft even uploaded an example model to Azure Blob Storage based on the first book, Harry Potter and the Sorcerer's Stone

1

. The blog closed with an LLM-generated image depicting caricatures of Harry Potter and Ron Weasley on a train with a Microsoft logo between them

2

.

Source: Ars Technica

Source: Ars Technica

Ethical and Legal Challenges in AI Development

Cathay Y. N. Smith, a law professor and co-director of Chicago-Kent College of Law's Program in Intellectual Property Law, suggested Kamath may not have realized the books were too recent to be in the public domain

1

. "Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last," Smith explained

1

. The incident underscores significant ethical challenges in AI development when the use of copyrighted material occurs improperly for machine learning training purposes

2

.

Broader Industry Implications and Ongoing Lawsuits

The blog post emerged at a time when AI firms began facing lawsuits over AI model training accused of infringing copyrights by allegedly training on pirated materials and regurgitating works verbatim

1

. Authors have filed lawsuits against Meta, OpenAI, Nvidia, Alphabet, Anthropic, Microsoft, and others, seeking to stop training on copyrighted works or obtain remuneration for books already incorporated into LLM training without permission

2

. Initial court results have been mixed, with some finding the results of training models "transformative" and thus substantively different from the core data under fair use doctrine, while others maintain that initial acts of piracy must still be prosecuted

2

. Microsoft declined to comment on the matter, and Kaggle did not respond to requests for comment

1

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo