2 Sources
2 Sources
[1]
Microsoft removes guide on how to train LLMs on pirated Harry Potter books
Following backlash in a Hacker News thread, Microsoft deleted a blog post that critics said encouraged developers to pirate Harry Potter books to train AI models that could then be used to create AI slop. The blog, which is archived here, was written in November 2024 by a senior product manager, Pooja Kamath. According to her LinkedIn, Kamath has been at Microsoft for more than a decade and remains with the company. In 2024, Microsoft tapped her to promote a new feature that the blog said made it easier to "add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs." What better way to show "engaging and relatable examples" of Microsoft's new feature that would "resonate with a wide audience" than to "use a well-known dataset" like Harry Potter books, the blog said. The books are "one of the most famous and cherished series in literary history," the blog noted, and fans could use the LLMs they trained in two fun ways: building Q&A systems providing "context-rich answers" and generating "new AI-driven Harry Potter fan fiction" that's "sure to delight Potterheads." To help Microsoft customers achieve this vision, the blog linked to a Kaggle dataset that included all seven Harry Potter books, which, Ars verified, has been available online for years and incorrectly marked as "public domain." Kaggle's terms say that rights holders can send notices of infringing content and repeated offenders risk suspensions, but Hacker News commenters speculated that the Harry Potter dataset flew under the radar, with only 10,000 downloads over time not catching the attention of J.K. Rowling, who famously keeps a strong grip on the Harry Potter copyrights. The dataset was promptly deleted on Thursday after Ars reached out to the uploader, Shubham Maindola, a data scientist in India with no apparent links to Microsoft. Maindola told Ars that "the dataset was marked as Public Domain by mistake. There was no intention to misrepresent the licensing status of the works." It's unclear whether Kamath was directed to link to the Harry Potter books dataset in the blog, or it was an individual choice. Cathay Y. N. Smith, a law professor and co-director of Chicago-Kent College of Law's Program in Intellectual Property Law, told Ars that Kamath may not have realized the books were too recent to be in the public domain. "Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last," Smith said. "Especially if she saw that something was marked by another reputable company as being public domain." Microsoft declined Ars' request to comment. Kaggle did not respond to Ars' request to comment. Microsoft pulling blog was "probably smart" On Hacker News, commenters suggested that it's unlikely anyone familiar with the popular franchise would believe the Harry Potter books were in the public domain. They debated whether Microsoft's blog was "problematic copyright-wise," since Microsoft not only encouraged customers to download the infringing materials but also used the books themselves to create Harry Potter AI models that relied on beloved characters to hype Microsoft products. Microsoft's blog was posted more than a year ago, at a time when AI firms began facing lawsuits over AI models accused of infringing copyrights by allegedly training on pirated materials and regurgitating works verbatim. The blog recommended that users learn to train their own AI models by downloading the Harry Potter dataset and then uploading text files to Azure Blob Storage. It included example models based on a dataset that Microsoft seemingly uploaded to Azure Blob Storage, which only included the first book, Harry Potter and the Sorcerer's Stone. Training large language models (LLMs) on text files, Harry Potter fans could create Q&A systems capable of pulling up relevant excerpts of books. An example query offered was "Wizarding World snacks," which retrieved an excerpt from The Sorcerer's Stone where Harry marvels at strange treats like Bertie Bott's Every Flavor Beans and chocolate frogs. Another prompt asking "How did Harry feel when he first learnt that he was a Wizard?" generated an output pointing to various early excerpts in the book. Example from Microsoft's blog of a Q&A system output. Example from Microsoft's blog of a Q&A system output. Example from Microsoft's blog of a Q&A system output. Example from Microsoft's blog of a Q&A system output. Example from Microsoft's blog of a Q&A system output. Example from Microsoft's blog of a Q&A system output. But perhaps an even more exciting use case, Kamath suggested, was generating fan fiction to "explore new adventures" and "even create alternate endings." That model could quickly comb the dataset for "contextually similar" excerpts that could be used to output fresh stories that fit with existing narratives and incorporate "elements from the retrieved passages," the blog said. As an example, Kamath trained a model to write a Harry Potter story she could use to market the feature she was blogging about. She asked the model to write a story in which Harry meets a new friend on the Hogwarts Express train who tells him all about Microsoft's Native Vector Support in SQL "in the Muggle world." Drawing on parts of The Sorcerer's Stone where Harry learns about Quidditch and gets to know Hermione Granger, the fan fiction showed a boy selling Harry on Microsoft's "amazing" new feature. To do this, he likened it to having a spell that helps you find exactly what you need among thousands of options, instantly, while declaring it was perfect for machine learning, AI, and recommendation systems. Further blurring the lines between Microsoft and Harry Potter brands, Kamath also generated an image showing Harry with his new friend, stamped with a Microsoft logo. Smith told Ars that both use cases could frustrate rights holders, depending on the content in the model outputs. "I think that the regurgitation and the creation of fan fiction, they both could flag copyright issues in that fan fiction often has to take from the expressive elements, a copyrighted character, a character that's famous enough to be protected by a copyright law or plot stories or sequences," Smith said. "If these things are copied and reproduced, then that output could be potentially infringing." But it's also still a gray area. Looking at the blog, Smith said, "I would be concerned," but "I wouldn't say it's automatically infringement." Smith told Ars that Microsoft pulling the blog "was probably smart" since courts have only generally said that AI training on copyrighted books is fair use. But courts continue to probe questions about pirated AI training materials. On the deleted Kaggle dataset page, Maindola previously explained that to source the data, he "downloaded the ebooks and then converted them to txt files." Microsoft may have infringed copyrights If Microsoft ever faced questions over whether the company knowingly used pirated books to train the example models, fair use "could be a difficult argument," Smith said. Hacker News commenters suggested the blog could be considered fair use, since the training guide was for "educational purposes," and Smith said that Microsoft could raise some "good arguments" in its defense. However, she also suggested that Microsoft could be deemed liable in some ways for contributing to infringement on some level after leaving the blog up for a year. Before it was removed, the Kaggle dataset was downloaded more than 10,000 times. "The ultimate result is to create something infringing by saying, 'Hey, here you go, go grab that infringing stuff and use that in our system,'" Smith said. "They could potentially have some sort of secondary contributory liability for copyright infringement, downloading it, as well as then using it to encourage others to use it for training purposes." On Hacker News, commenters slammed the blog, including a self-described former Microsoft employee who claimed that Microsoft lets employees "blog without having to go through some approval or editing process." "It looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed," the former employee said. Others suggested the blame was solely with the Kaggle uploader, Maindola, who told Ars that the dataset should never have been marked "public domain." But Microsoft critics pushed back, noting that the Kaggle page made it clear that no special permission was granted and Microsoft's employee should have known better. "They don't need to know any details to know that these properties belong to massive companies and aren't free for the taking," one commenter said. The Harry Potter books weren't the only books targeted, the thread noted, linking to a separate Azure sample containing Isaac Asimov's Foundation series, which is also not in the public domain. "Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels," another Hacker News commenter wrote. "Instead, they opted to use copywritten works that J.K. hasn't released into the public domain (unless user 'Shubham Maindola' is J.K.'s alter ego)." Smith suggested Microsoft could have avoided this week's backlash by more carefully reviewing blogs, noting that "if a company is risk averse, this would probably be flagged." But she also understood Kamath's preference for Harry Potter over the many long-forgotten characters that exist in the public domain. On Hacker News, some commenters defended Kamath's blog, urging that it should be considered fair use since nonprofits and educational institutions could do the same thing in a teaching context without issue. "I would have been concerned if I were the one clearing this for Microsoft, but at the same time, I completely understand what this employee was doing," Smith said. "No one wants to write fan fiction about books that are in the public domain."
[2]
Accio Lawyers! Microsoft manager trained AI on pirated Potter books
This case underscores significant ethical challenges in AI development when copyrighted material is improperly used for machine learning training purposes. Oh, my. With "AI" systems causing a lot of problems pretty much everywhere, it's a bad look for one of the world's most important tech companies to actively promote piracy. But that appears to be just what happened, with a post hosted on Microsoft's developer blog, actively using an apparently pirated set of Harry Potter novels to train an Azure-based "AI" system. "The Harry Potter series, written by J.K. Rowling, is a globally beloved collection of seven books that follow the journey of a young wizard, Harry Potter, and his friends as they battle the dark forces led by the evil Voldemort," wrote Pooja Kamath, a Microsoft Senior Product Manager. The blog post then pointed to a Kaggle dataset link that contained seven TXT files, apparently encompassing the entire published novel series. The blog post was a guide on adding generative "AI" to applications via Azure. The manager said that it could be used to create a Q&A system, or auto-generate Harry Potter fan fiction. "This feature is sure to delight Potterheads, allowing them to explore new adventures and create their own magical stories." It closes with an LLM-generated image of two children on a train, obviously caricatures of Harry Potter and Ron Weasley, with a Microsoft logo between them. This is, in technical legalistic terms, a big frickin' no-no. All the Harry Potter novels are, of course, held under copyright by various entities around the world, including the author. A quick browse on Amazon shows that a complete collection costs $70 USD in ebook format at the time of writing. Hosting or downloading the files for free without paying any kind of royalty is a crime basically everywhere. Yes, that includes downloading it even if all you intend to do is plug it into a large language model. The original Microsoft how-to post was published in late 2024, and has been removed from the site (though it's still accessible via the Internet Archive). Ditto for the Kaggle dataset, which was mistakenly marked as "public domain" and only downloaded about 10,000 times, according to a report from Ars Technica. Both the blog post and the pirated data set seem to have flown under the radar for a year and a half, until a Hacker News thread yesterday brought new attention to them. It's shocking that a Microsoft manager would be so casual about ebook piracy in a public post on a Microsoft blog (though Kamath may not understand how the public domain system works and assumed the files were marked correctly.). But the most popular large language models have been trained on millions of ebooks, many (possibly even a majority) of which have been downloaded via illegal piracy. Authors have filed lawsuits against Meta/Facebook, OpenAI, Nvidia, Alphabet/Google, Anthropic, Microsoft, and others, aiming to stop training on copyrighted works and/or seek remuneration for books already incorporated into LLM training without permission. Initial results in the courts have been mixed, sometimes finding the results of training models "transformative" and thus substantively different from the core data, i.e., fair use, and some finding that initial acts of piracy must still be prosecuted.
Share
Share
Copy Link
Microsoft removed a developer blog post that instructed users on how to train large language models using pirated Harry Potter books. The post, written by senior product manager Pooja Kamath in November 2024, linked to a Kaggle dataset containing all seven novels incorrectly marked as public domain. The removal highlights growing ethical and legal challenges surrounding AI model training and the use of copyrighted material without permission.
Microsoft deleted a developer blog post after significant backlash on Hacker News revealed that the guide encouraged developers to train large language models using pirated Harry Potter books
1
. The post, authored by senior product manager Pooja Kamath in November 2024, demonstrated how to add generative AI features to applications using Azure SQL DB, LangChain, and LLMs1
. Kamath, who has been with Microsoft for over a decade, linked to a Kaggle dataset containing all seven Harry Potter novels as an "engaging and relatable example" to showcase the company's new feature1
.
Source: PCWorld
The Kaggle dataset was incorrectly marked as public domain despite the Harry Potter series being firmly protected under copyright held by J.K. Rowling and various entities worldwide
2
. The dataset had been available online for years, accumulating approximately 10,000 downloads before being deleted on Thursday after Ars Technica reached out to the uploader, Shubham Maindola, a data scientist in India with no apparent links to Microsoft1
. Maindola stated that "the dataset was marked as Public Domain by mistake" with no intention to misrepresent the licensing status1
. A complete collection of the novels costs $70 in ebook format on Amazon, making the free distribution a clear case of copyright infringement2
.The Microsoft developer blog post explained how users could train AI models by downloading the Harry Potter dataset and uploading text files to Azure Blob Storage
1
. The guide suggested two primary use cases: building Q&A systems that provide context-rich answers and generating AI-driven fan fiction to "delight Potterheads"1
. Microsoft even uploaded an example model to Azure Blob Storage based on the first book, Harry Potter and the Sorcerer's Stone1
. The blog closed with an LLM-generated image depicting caricatures of Harry Potter and Ron Weasley on a train with a Microsoft logo between them2
.
Source: Ars Technica
Related Stories
Cathay Y. N. Smith, a law professor and co-director of Chicago-Kent College of Law's Program in Intellectual Property Law, suggested Kamath may not have realized the books were too recent to be in the public domain
1
. "Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last," Smith explained1
. The incident underscores significant ethical challenges in AI development when the use of copyrighted material occurs improperly for machine learning training purposes2
.The blog post emerged at a time when AI firms began facing lawsuits over AI model training accused of infringing copyrights by allegedly training on pirated materials and regurgitating works verbatim
1
. Authors have filed lawsuits against Meta, OpenAI, Nvidia, Alphabet, Anthropic, Microsoft, and others, seeking to stop training on copyrighted works or obtain remuneration for books already incorporated into LLM training without permission2
. Initial court results have been mixed, with some finding the results of training models "transformative" and thus substantively different from the core data under fair use doctrine, while others maintain that initial acts of piracy must still be prosecuted2
. Microsoft declined to comment on the matter, and Kaggle did not respond to requests for comment1
.Summarized by
Navi
26 Jun 2025•Policy and Regulation

15 Jul 2025•Policy and Regulation

08 Feb 2025•Technology
1
Technology

2
Policy and Regulation

3
Policy and Regulation
