Nvidia accused of seeking pirated books from shadow library to train AI models amid lawsuit

4 Sources

Share

Nvidia faces expanded allegations in a class-action lawsuit claiming the company sought high-speed access to 500 terabytes of pirated books from Anna's Archive for AI training. Internal emails reportedly show management approved the deal despite warnings about illegally obtained data, raising questions about how tech giants source training data for their language models.

Nvidia contacted shadow library for AI training data, lawsuit alleges

Nvidia has been accused of attempting to secure high-speed access to Anna's Archive, a notorious shadow library containing millions of pirated books, to fuel its AI training efforts. According to an amended complaint filed in the U.S. District Court for the Northern District of California, internal emails reveal that the Nvidia data strategy team contacted Anna's Archive to explore using its massive repository for pre-training Large Language Models

1

2

. The lawsuit, which now includes authors Abdi Nazemian, Brian Keene, Stewart O'Nan, Andre Dubus III, and Susan Orlean, significantly expands the scope of copyright infringement claims against the GPU giant

3

.

Source: TweakTown

Source: TweakTown

Court documents show management approval despite warnings

The amended complaint cites internal Nvidia communications that appear damning. One email snippet shows an unnamed Nvidia executive writing: "we are exploring including Anna's Archive in pre-training data for our LLMs" and seeking "to get a better understanding of LLM-related work you have done"

2

. More significantly, the complaint alleges that Anna's Archive warned Nvidia that its library content was illegally obtained and maintained, yet "within a week of contacting Anna's Archive, and days after being warned by Anna's Archive of the illegal nature of their collections, Nvidia management gave 'the green light' to proceed with the piracy"

1

. Anna's Archive reportedly charged tens of thousands of dollars for high-speed access to its collections and offered Nvidia approximately 500 terabytes of data, including millions of books typically available through Internet Archive's digital lending system

3

4

.

Source: PC Gamer

Source: PC Gamer

Competitive pressures and broader copyright infringement allegations

The complaint asserts that "competitive pressures drove NVIDIA to piracy," highlighting the intense race among AI companies to secure vast training datasets

3

4

. Nvidia develops its own AI models, including NeMo, Retro-48B, InstructRetro, and Megatron, which require massive text libraries for training

3

. Beyond Anna's Archive, Nvidia also faces accusations of using other pirated sources, including LibGen, Sci-Hub, and Z-Library, in addition to the Books3 database

3

. The lawsuit further alleges that Nvidia distributed scripts and tools enabling corporate customers to download The Pile, which contains the Books3 pirated dataset, introducing new claims of vicarious infringement and contributory infringement

3

.

Source: Tom's Hardware

Source: Tom's Hardware

Industry-wide pattern of training AI with pirated data

Nvidia isn't alone in facing scrutiny over data sourcing practices. Meta and Anthropic have previously been found using pirated content from Books3 for their language models

1

2

. However, this marks the first public disclosure of correspondence between a major U.S. tech company and Anna's Archive

3

. In a previous case against Anthropic, a court judge ruled that while accessing the data did fall under fair use, "Anthropic had no entitlement to use pirated copies for its central library"

2

. Nvidia has defended its actions under fair use, arguing that "training measures statistical correlations in the aggregate, across a vast body of data" and that books represent mere statistical correlations to its AI models

2

3

.

What this means for intellectual property rights and AI development

The evidence presented during the discovery phase of this class-action lawsuit raises critical questions about how AI companies balance innovation with intellectual property rights. These court documents appear to show that super-wealthy firms jealously guard their own technologies while showing little regard for the intellectual property of others

1

. The authors seek compensation for damages, and hundreds of other authors whose work appears in the massive pirate library may later join the class-action lawsuit

1

3

. The complaint does not explicitly state whether Nvidia followed through with paying for access to the dataset or if the deal was ultimately completed

3

4

. As this case progresses, it may set precedents for how courts interpret fair use in the context of AI training and whether companies can legally leverage illegally obtained data for commercial AI development.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo