Curated by THEOUTPOST
On Fri, 21 Feb, 4:04 PM UTC
2 Sources
[1]
Meta defends its vast book torrenting: We're just a leech, no proof of seeding
Just because Meta admitted to torrenting a dataset of pirated books for AI training purposes, that doesn't necessarily mean that Meta seeded the file after downloading it, the social media company claimed in a court filing this week. Evidence instead shows that Meta "took precautions not to 'seed' any downloaded files," Meta's filing said. Seeding refers to sharing a torrented file after the download completes, and because there's allegedly no proof of such "seeding," Meta insisted that authors cannot prove Meta shared the pirated books with anyone during the torrenting process. Whether or not Meta actually seeded the pirated books could make a difference in a copyright lawsuit from book authors including Richard Kadrey, Sarah Silverman, and Ta-Nehisi Coates. Authors had previously alleged that Meta unlawfully copied and distributed their works through AI outputs -- an increasingly common complaint that so far has barely been litigated. But Meta's admission to torrenting appears to add a more straightforward claim of unlawful distribution of copyrighted works through illegal torrenting, which has long been considered established case-law. Authors have alleged that "Meta deliberately engaged in one of the largest data piracy campaigns in history to acquire text data for its LLM training datasets, torrenting and sharing dozens of terabytes of pirated data that altogether contain many millions of copyrighted works." Separate from their copyright infringement claims opposing Meta's AI training on pirated copies of their books, authors alleged that Meta torrenting the dataset was "independently illegal" under California's Computer Data Access and Fraud Act (CDAFA), which allegedly "prevents the unauthorized taking of data, including copyrighted works." Meta, however, is hoping to convince the court that torrenting is not in and of itself illegal, but is, rather, a "widely-used protocol to download large files." According to Meta, the decision to download the pirated books dataset from pirate libraries like LibGen and Z-Library was simply a move to access "data from a 'well-known online repository' that was publicly available via torrents." To defend their torrenting, Meta has basically scrubbed the word "pirate" from the characterization of its activity. The company alleges that authors can't claim that Meta gained unauthorized access to their data under CDAFA. Instead, all they can claim is that "Meta allegedly accessed and downloaded datasets that Plaintiffs did not create, containing the text of published books that anyone can read in a public library, from public websites Plaintiffs do not operate or own." While Meta may claim there's no evidence of seeding, there is some testimony that might be compelling to the court. Previously, a Meta executive in charge of project management, Michael Clark, had testified that Meta allegedly modified torrenting settings "so that the smallest amount of seeding possible could occur," which seems to support authors' claims that some seeding occurred. And an internal message from Meta researcher Frank Zhang appeared to show that Meta allegedly tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers. Once this information came to light, authors asked the court for a chance to depose Meta executives again, alleging that new facts "contradict prior deposition testimony." Torrenting terminology may confuse court How successful Meta's torrenting defense will be is still up in the air, but authors pointed out that even if Meta somehow managed to avoid seeding any of the torrented books, the social media giant still participated in an "online piracy ring." Further, in a footnote, authors told the court that "IP pirates like Meta also upload or share files with others during (leeching) and after (seeding) downloading." Additionally, TorrentFreak noted that Meta "taking precautions is not the same as preventing" seeding. Authors will likely push to persuade the court that merely by torrenting the file, Meta made "pirated works available to other users worldwide," while making it clear that even Meta can't claim to have prevented all seeding. A lawyer representing the authors declined to comment on whether ongoing discovery may surface more evidence to help prove the seeding claims. Lack of evidence could be a problem since TorrentFreak suggested the torrenting terminology may be foreign to the court, potentially muddying what authors feel otherwise is a straightforward claim that Meta allegedly knew it was violating laws by torrenting the pirated books. Meta has been silent so far on claims about sharing data while "leeching" (downloading) but told the court it plans to fight the seeding claims at summary judgment. At this time, Meta has moved to dismiss authors' CDAFA claim as being preempted by copyright law, but unsurprisingly authors told the court that they strongly disagree. "Had Meta bought Plaintiffs' works in a bookstore or borrowed them from a library and then trained its LLMs on them without a license, it would have committed copyright infringement, but no CDAFA violation," authors alleged. "Meta's decision to bypass lawful acquisition methods and become a knowing participant in an illegal peer-to-peer piracy network provides the 'extra element' and is 'qualitatively different' to establish an independent CDAFA violation." Authors further linked the CDAFA claim to their copyright infringement claim opposing Meta's AI training. They alleged that by torrenting their works "from pirated databases in lieu of executing lawful licensing arrangements, Meta not only deprived Plaintiffs of that licensing revenue, but it also deprived Plaintiffs of additional revenue they could have generated from 'other users worldwide' because Meta simultaneously made the copyrighted works available to download by any interested Internet user in the process of acquiring Plaintiffs' data" to train AI. Meta did not immediately respond to Ars' request for comment.
[2]
Meta defends using pirated material, claims it's legal if you don't seed content
Configuration settings were modified "so that the smallest amount of seeding possible could occur". Meta claimed in a court filing this week that despite torrenting an 82 TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models, that employees "took precautions not to "seed" any downloaded files". The act of Seeding in torrenting terminology refers to sharing a file with other users during, (or commonly after) downloading it. Since torrenting is a peer-to-peer system, every user downloading a file can also upload parts of it to other users. Meta's lawyers claim that there are "no facts to show that Meta seeded Plaintiffs' books". This means that the company's defense is pinning hopes on the fact that there isn't currently any proof that Meta shared the material during the torrenting process. Though Meta claims that there is no evidence of seeding, Michael Clark, an executive at Meta in charge of project management testified that the configuration settings they were using were modified "so that the smallest amount of seeding possible could occur". Following this statement, a question regarding why Meta chose to minimize seeding was asked, attorney-client privilege was invoked so that Clark could not answer. Interestingly, the statement issued by Clark shows that Meta sought methods to minimize seeding, but has yet to offer up indication that it entirely prevented seeding copyrighted material. Additionally, an internal message from Frank Zhang, a Meta researcher, could point toward alleged concealment of potential seeding from Meta's servers, to avoid "risk of tracing back the seeder/downloader" to Facebook servers. Meta's defense seems to hinge around the lack of evidence around not sharing the large amount of data they have allegedly downloaded to train its AI models. Should Meta win on this defense and prove that downloading copyrighted content isn't illegal, but distribution is, it could shake up future cases of piracy and unauthorized distribution of copyrighted content. The defense relying on torrenting terminology could also a way for Meta to aim in tripping up courts. Focusing on seeding could further muddy the claim that Meta allegedly knew that it was violating laws by torrenting copyrighted material. Meta has yet to respond to claims surrounding on whether it knew that it was sharing data during the download process. Authors of the copyrighted material alleged to have been obtained by Meta without prior licensing agreements have alleged [PDF] that "Meta's decision to bypass lawful acquisition methods and become a knowing participant in an illegal peer-to-peer piracy network". With the court battle expected to continue, no final decision around the case has been made. Even following a final decision, it's expected that Meta will attempt to appeal the decision if they were to lose, meaning that final judgements could be a long while away. But, similar cases do exist. OpenAI was sued by novelists in 2023, with the New York Times also suing OpenAI and Microsoft over "millions" copied news articles. As the long list of LLM-related litigation continues, this is likely not going to be the last we hear from Meta's specific case.
Share
Share
Copy Link
Meta claims it didn't seed pirated books used for AI training, sparking debate on copyright infringement and data acquisition methods in AI development.
Meta, the social media giant, is embroiled in a legal battle over its use of pirated books to train its AI models. In a recent court filing, Meta defended its actions by claiming that while it did torrent a dataset of pirated books, it took precautions not to "seed" any downloaded files 1.
Meta admitted to torrenting an 82 TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models. However, the company insists that there is no evidence of "seeding" - the act of sharing a torrented file after the download completes 2.
The lawsuit, filed by authors including Richard Kadrey, Sarah Silverman, and Ta-Nehisi Coates, alleges that Meta unlawfully copied and distributed their works through AI outputs. Meta's defense hinges on the lack of evidence of seeding, arguing that downloading copyrighted content isn't illegal, but distribution is 1.
Despite Meta's claims, there is testimony that might challenge their defense:
Meta is attempting to dismiss the authors' claim under California's Computer Data Access and Fraud Act (CDAFA), arguing it's preempted by copyright law. The authors contend that Meta's "decision to bypass lawful acquisition methods" constitutes a separate CDAFA violation 1.
This case highlights the ongoing tension between AI development and copyright law. Similar lawsuits have been filed against other AI companies, including OpenAI and Microsoft, over the use of copyrighted material for training large language models 2.
The outcome of this case could have far-reaching implications for the AI industry, potentially setting precedents for how companies can legally acquire and use data for AI training. It also raises questions about the ethics of using pirated material for technological advancement 12.
As the court battle continues, no final decision has been made. Meta is expected to fight the seeding claims at summary judgment, and any decision is likely to face appeals, suggesting a long legal process ahead 12.
Reference
Meta is embroiled in a lawsuit accusing the company of using torrented copyrighted books to train its AI models, potentially setting a precedent for how courts view copyright law in AI development.
6 Sources
6 Sources
Meta is embroiled in a lawsuit alleging the company used pirated books to train its AI models, including Llama. Internal communications reveal ethical concerns and attempts to conceal the practice.
11 Sources
11 Sources
Meta CEO Mark Zuckerberg defends the use of copyrighted e-books to train AI models, comparing it to YouTube's content moderation challenges. The case raises questions about fair use in AI development.
17 Sources
17 Sources
French publishing and authors' associations have filed a lawsuit against Meta, accusing the tech giant of using copyrighted content without permission to train its AI models. This marks the first such legal action against an AI company in France.
11 Sources
11 Sources
OpenAI, the company behind ChatGPT, has responded to copyright infringement lawsuits filed by authors, denying allegations and asserting fair use. The case highlights the ongoing debate surrounding AI and intellectual property rights.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved