4 Sources
4 Sources
[1]
Nvidia accused of trying to cut a deal with Anna's Archive for high‑speed access to the massive pirated book haul -- allegedly chased stolen data to fuel its LLMs
Court documents appear to show Nvidia management green lit the deal, despite Anna's Archive's warnings. Nvidia has been accused of offering to pay for 'high-speed access' to Anna's Archive, a notorious 'shadow library' portal, bursting with copyright-infringing materials. Documents published by TorrentFreak appear to show the Nvidia Data Strategy Team reaching out regarding payments for 'high-speed access' to Anna's Archive. Moreover, if the documents are genuine, they indicate that green team management approved the payment plan "within a week." Nvidia, like other AI industry giants, is very interested in gaining access to the largest sources of human knowledge to improve LLM training quality. The likes of Meta and Anthropic have previously been found with their fingers all over pirated content. These super-wealthy firms jealously guard their own technologies, so evidence that they seem to have little or no regard for the intellectual property of others would be a source of irony. TorrentFreak notes that the email snippets it has shared have been precipitated during the discovery phase of an ongoing class action lawsuit where Nvidia is accused of copyright infringement by training its models on content from the Books3 dataset, including copyrighted works taken from pirate site Bibliotik. In that case, Nvidia is defending its actions under 'fair use,' but the new evidence showing Anna's Archive correspondence looks compelling. In fact, the authors behind the Books3 class action have filed an amended complaint significantly expanding the scope of the lawsuit, says TorrentFreak. One of the most damning pieces of correspondence between Nvidia reps and Anna's Archive is shown above. The snippet appears to show an unnamed Nvidia exec inquiring about the use of Anna's Archive for LLM training. Probably worse, though, is the section of the new court filing which alleges that "Within a week of contacting Anna's Archive, and days after being warned by Anna's Archive of the illegal nature of their collections, Nvidia management gave 'the green light' to proceed with the piracy." The proposed deal would mean providing Nvidia with high-speed access to ~500TB of data for LLM training. We don't see evidence that the deal actually went through, or that any payments went to Anna's Archive. Nvidia is also accused of giving corporate customers automatic access to datasets such as 'The Pile,' which includes the Books3 pirated collection. The authors behind the class action are looking for compensation for the damages they have suffered. Hundreds of other authors whose work is within the huge pirate library may later join the class action lawsuit. Anna's Archive remains online for now, though its rising profile has pushed it into the inevitable DCMA takedown notice whack‑a‑mole stage. As mentioned in the intro, 'Books3' was also dredged by Meta and Anthropic LLMs. However, this is the first allegation of a formal Anna's Archive business arrangement between a U.S. company and the copyright-infringing books repository. We have reached out to Nvidia for comment on the story.
[2]
Nvidia allegedly greenlit the use of pirated books from illegal sources to train its AI models, according to an expanded class-action lawsuit
The capabilities of AI models, such as GPT-5, Gemini, Claude, and Grok, lie in the size and scope of the dataset used to train them. This has also been the source of multiple lawsuits, claiming that the companies performing the training had no right to freely use the data. In an expanded class-action case against Nvidia, however, the accusation goes one step further, with claims that the GPU giant willingly used an illegal source of pirated books to train its models. As reported by TorrentFreak, an amended complaint (pdf warning) filed at the district court in Oakland, California last week, specifically claims that staff at Nvidia contacted a so-called 'shadow library' known as Anna's Archive, a repository of pirated books and other documents. The plaintiffs cite internal Nvidia communications as evidence, with the filed document purporting to show someone from the data strategy team at Nvidia writing, "we are exploring including Anna's Archive in pre-training data for our LLMs." It continues with "We are figuring out internally whether we are willing to accept the risk of using this data, but would like to speak with your team to get a better understanding of LLM-related work you have done." While Anna's Archive appears not to host any content directly itself, it does act as a 'search engine' for alleged pirate libraries. These third-party hosts aren't exclusively providing access to copyrighted materials, but that content is what they are most infamous for. The original complaint against Nvidia was filed back in 2024, and as Torrent Freak reported at the time, Nvidia's response was essentially to claim that AI training on such material is not the same as owning an illegally obtained book, or even using it as a human does. "Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model," it wrote in response. In essence, Nvidia is saying that the use of such datasets falls under fair use. Given that the original complaint involved data garnered from another pirated source (Books3), it's possible that Nvidia may choose to use the same counterargument from 2024. Similar claims have been filed against Anthropic and Meta in the past, and in the case of the former, the court judge ruled that while accessing the data did fall under fair use, "Anthropic had no entitlement to use pirated copies for its central library." How the case against Nvidia will fare, well, we'll just have to wait and see.
[3]
Claim: NVIDIA green-lit pirated book downloads for AI training
NVIDIA executives authorized using millions of pirated books from Anna's Archive for AI training, according to an expanded class-action lawsuit. The suit, citing internal NVIDIA documents, alleges the company contacted Anna's Archive for high-speed access to its data. NVIDIA has benefited from the artificial intelligence boom, with revenue surging due to high demand for its AI-learning chips and data center services. NVIDIA develops its own AI models, including NeMo, Retro-48B, InstructRetro, and Megatron. These models are trained using NVIDIA hardware and large text libraries, similar to practices at other technology companies. The company has faced legal challenges from copyright holders regarding its training methodologies. Authors first sued NVIDIA in early 2024 for copyright infringement, claiming the company's AI models were trained on the Books3 dataset, which included copyrighted works from Bibliotik without permission. NVIDIA defended its actions as fair use, stating that books are statistical correlations to its AI models. However, new evidence emerged during discovery. Plaintiffs filed an amended complaint last Friday, expanding the lawsuit's scope by adding more books, authors, and AI models. The amended complaint includes broader "shadow library" claims. Authors, including Abdi Nazemian, now cite internal NVIDIA emails and documents, alleging the company willingly downloaded millions of copyrighted books. The complaint claims "competitive pressures drove NVIDIA to piracy," involving collaboration with Anna's Archive. According to the amended complaint, a member of NVIDIA's data strategy team contacted Anna's Archive to inquire about acquiring its pirated materials for pre-training large language models, including Anna's Archive. The complaint states Anna's Archive charged tens of thousands of dollars for "high-speed access" to its collections, and NVIDIA sought details on this access. The complaint alleges Anna's Archive warned NVIDIA that its library content was illegally acquired and maintained. Anna's Archive reportedly asked NVIDIA executives for internal permission to proceed, which was granted within a week. After receiving permission from NVIDIA management, Anna's Archive provided access to its pirated books. Anna's Archive offered NVIDIA access to approximately 500 terabytes of data, including millions of books typically available through Internet Archive's digital lending system. The complaint does not specify if NVIDIA paid Anna's Archive. NVIDIA also faces accusations of using other pirated sources, including LibGen, Sci-Hub, and Z-Library, in addition to the Books3 database. Authors allege NVIDIA not only downloaded and used pirated books for its AI training but also distributed scripts and tools enabling corporate customers to download "The Pile," which contains the Books3 pirated dataset. These allegations introduce new claims of vicarious and contributory infringement, asserting NVIDIA generated revenue from customers by facilitating access to these pirated datasets. The authors seek compensation for damages for named authors and potentially hundreds of others joining the class-action lawsuit. This revelation marks the first public disclosure of correspondence between a major U.S. tech company and Anna's Archive. The first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, names authors Abdi Nazemian, Brian Keene, Stewart O'Nan, Andre Dubus III, and Susan Orlean.
[4]
Lawsuit alleges NVIDIA approved use of pirated books to train AI models
TL;DR: A lawsuit alleges NVIDIA executives approved partnering with Anna's Archive, a site hosting millions of pirated books and papers, to use its data for training Large Language Models. Internal emails reveal NVIDIA sought access to 500 terabytes of illegally obtained content amid competitive pressures. A complaint filed in the US District Court claims NVIDIA executives approved contact with Anna's Archive, a website that harbors millions of copyrighted books and academic papers, to discuss a partnership that involves using Anna's Archive as a dataset for training its Large Language Models (LLMs). The complaint alleges that "competitive pressures drove NVIDIA to piracy," and that internal NVIDIA emails demonstrate a member of the company's data strategy team contacting Anna's Archive about the collaboration. Furthermore, the complaint states that Anna's Archive warned NVIDIA that its treasure trove of data was obtained illegally, and asked how Team Green wanted to proceed. The lawsuit states that within a week, NVIDIA approved of the collaboration, and in response, Anna's Archive offered NVIDIA approximately 500 terabytes of data. "Desperate for books, NVIDIA contacted Anna's Archive -- the largest and most brazen of the remaining shadow libraries -- about acquiring its millions of pirated materials and 'including Anna's Archive in pre-training data for our LLMs,'" the complaint notes. Furthermore, the complaint states that the 500 terabytes of data included millions of books that are only accessible through the Internet Archive's digital lending system. Notably, the complaint does not explicitly state whether NVIDIA followed through with the transaction of paying for access to the dataset offered by Anna's Archive. "Because Anna's Archive charged tens of thousands of dollars for 'high-speed access' to its pirated collections [] NVIDIA sought to find out what "high-speed access" to the data would look like," reads the complaint
Share
Share
Copy Link
Nvidia faces expanded allegations in a class-action lawsuit claiming the company sought high-speed access to 500 terabytes of pirated books from Anna's Archive for AI training. Internal emails reportedly show management approved the deal despite warnings about illegally obtained data, raising questions about how tech giants source training data for their language models.
Nvidia has been accused of attempting to secure high-speed access to Anna's Archive, a notorious shadow library containing millions of pirated books, to fuel its AI training efforts. According to an amended complaint filed in the U.S. District Court for the Northern District of California, internal emails reveal that the Nvidia data strategy team contacted Anna's Archive to explore using its massive repository for pre-training Large Language Models
1
2
. The lawsuit, which now includes authors Abdi Nazemian, Brian Keene, Stewart O'Nan, Andre Dubus III, and Susan Orlean, significantly expands the scope of copyright infringement claims against the GPU giant3
.
Source: TweakTown
The amended complaint cites internal Nvidia communications that appear damning. One email snippet shows an unnamed Nvidia executive writing: "we are exploring including Anna's Archive in pre-training data for our LLMs" and seeking "to get a better understanding of LLM-related work you have done"
2
. More significantly, the complaint alleges that Anna's Archive warned Nvidia that its library content was illegally obtained and maintained, yet "within a week of contacting Anna's Archive, and days after being warned by Anna's Archive of the illegal nature of their collections, Nvidia management gave 'the green light' to proceed with the piracy"1
. Anna's Archive reportedly charged tens of thousands of dollars for high-speed access to its collections and offered Nvidia approximately 500 terabytes of data, including millions of books typically available through Internet Archive's digital lending system3
4
.
Source: PC Gamer
The complaint asserts that "competitive pressures drove NVIDIA to piracy," highlighting the intense race among AI companies to secure vast training datasets
3
4
. Nvidia develops its own AI models, including NeMo, Retro-48B, InstructRetro, and Megatron, which require massive text libraries for training3
. Beyond Anna's Archive, Nvidia also faces accusations of using other pirated sources, including LibGen, Sci-Hub, and Z-Library, in addition to the Books3 database3
. The lawsuit further alleges that Nvidia distributed scripts and tools enabling corporate customers to download The Pile, which contains the Books3 pirated dataset, introducing new claims of vicarious infringement and contributory infringement3
.
Source: Tom's Hardware
Related Stories
Nvidia isn't alone in facing scrutiny over data sourcing practices. Meta and Anthropic have previously been found using pirated content from Books3 for their language models
1
2
. However, this marks the first public disclosure of correspondence between a major U.S. tech company and Anna's Archive3
. In a previous case against Anthropic, a court judge ruled that while accessing the data did fall under fair use, "Anthropic had no entitlement to use pirated copies for its central library"2
. Nvidia has defended its actions under fair use, arguing that "training measures statistical correlations in the aggregate, across a vast body of data" and that books represent mere statistical correlations to its AI models2
3
.The evidence presented during the discovery phase of this class-action lawsuit raises critical questions about how AI companies balance innovation with intellectual property rights. These court documents appear to show that super-wealthy firms jealously guard their own technologies while showing little regard for the intellectual property of others
1
. The authors seek compensation for damages, and hundreds of other authors whose work appears in the massive pirate library may later join the class-action lawsuit1
3
. The complaint does not explicitly state whether Nvidia followed through with paying for access to the dataset or if the deal was ultimately completed3
4
. As this case progresses, it may set precedents for how courts interpret fair use in the context of AI training and whether companies can legally leverage illegally obtained data for commercial AI development.Summarized by
Navi
08 Feb 2025•Technology
11 Mar 2025•Policy and Regulation

18 Dec 2025•Policy and Regulation

1
Policy and Regulation

2
Technology

3
Technology
