Curated by THEOUTPOST
On Fri, 10 Jan, 12:05 AM UTC
17 Sources
[1]
Zuckerberg's YouTube Defense Sparks Debate in Meta's AI Copyright Battle
Zuckerberg's Unconventional Defense: Fair Use or Flawed Logic in AI Training Debate? Meta CEO Mark Zuckerberg recently defended his company's use of a copyrighted e-book dataset in a deposition for the ongoing AI copyright case, Kadrey v. The case is part of a massive lawsuit that involves AI companies as defendants and copyright holders are seeking to determine whether training AI models using copyrighted materials constitutes 'fair use'. Zuckerberg likened Meta's predicament to YouTube's struggle to remove unlawful content and said that, at times, it is not rational to do so and there should be an all-hands-on-deck approach toward one challenge.
[2]
In AI copyright case, Zuckerberg turns to YouTube for his defense | TechCrunch
Meta CEO Mark Zuckerberg appears to have used YouTube and its battle to take down pirated content to defend his own company's use of a data set containing copyrighted e-books to train AI models, newly released snippets of his deposition reveals. The deposition, which was part of a complaint submitted to the court by plaintiffs' attorneys, is related to AI copyright case Kadrey v. Meta. It's one of many such cases winding through the U.S. court system that's pitting AI companies against authors and other IP holders. For the most part, the defendants in these cases - AI companies - claim that training on copyrighted content is "fair use." Many copyright holders disagree. "For example, YouTube, I think, may end up hosting some stuff that people pirate for some period of time, but YouTube is trying to take that stuff down," Zuckerberg said during his deposition, according to portions of a transcript made available Wednesday night. "And the vast majority of the stuff on YouTube, I would assume, is kind of good and they have the license to do." Snippets from Zuckerberg's deposition provide some clues of Zuckerberg's thinking on copyright content and fair use. However, it should be noted that a full transcript of the deposition was not released. TechCrunch has reached out to Meta for additional context and will update the article if the company responds. Based on the deposition nuggets, Zuckerberg appears to be defending Meta's use of a training data set of e-books called LibGen to develop its family of AI models known as Llama. Meta's Llama competes against flagship models from AI companies like OpenAI. LibGen, which describes itself as a "links aggregator," provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. According to court filings unsealed this week, Zuckerberg allegedly cleared the use of LibGen to train at least one of Meta's Llama models despite concerns within the company's AI exec and research teams over the legal implications. Counsel for the plaintiffs, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, quoted Meta employees as referring to LibGen as a "data set we know to be pirated" and flagging that its use "may undermine [Meta's] negotiating position with regulators," according to a legal filing, During his deposition, Zuckerberg claimed he "hadn't really heard of" LibGen. "I get that you're trying to get me to give an opinion of LibGen, which I haven't really heard of," said Zuckerberg during the deposition. "It's just that I don't have knowledge of that specific thing." Under questioning from one of the plaintiffs' attorneys, David Boies, Zuckerberg explained why it would be unreasonable to prohibit using a data set like LibGen. "So would I want to have a policy against people using YouTube because some of the content may be copyrighted? No," he said. "[T]here are cases where having such a blanket ban might not be the right thing to do." Zuckerberg did state that Meta should be "pretty careful about" training on copyrighted material. "You know, [if there's] someone who's providing a website and they're intentionally trying to violate people's rights ... obviously it's something that we would want to be cautious about or careful about how we engaged with it or maybe even prevent our teams from engaging with it," Zuckerberg said during his deposition, according to the transcript. Plaintiffs' lawyers in the Kadrey v. Meta case have amended the complaint several times since it was filed in United States District Court for the Northern District of California, San Francisco Division in 2023. The latest amended complaint filed by plaintiffs' counsel late Wednesday contains new allegations against Meta, including that the company cross-referenced certain pirated books in LibGen with copyrighted books available for license. Lawyers allege Meta used this tactic to determine whether it made sense to pursue a licensing agreement with a publisher. Meta allegedly used LibGen to train its latest family of Llama models, Llama 3, per the amended filing. Plaintiffs also allege that Meta is using the data set to train its next-gen Llama 4 models. According to the amended filing, Meta researchers allegedly tried to hide the fact that Llama models were trained on copyrighted materials by inserting "supervised samples" into Llama's fine-tuning. And Meta downloaded pirated e-books from another source, Z-Library, for Llama training as recently as April 2024, the amended complaint alleges. Z-Library, or Z-Lib, has been the subject of a number of legal actions brought by publishers, including domain seizures and takedowns. In 2022, the Russian nationals who allegedly maintained it were charged with copyright infringement, wire fraud, and money laundering.
[3]
Mark Zuckerberg Allegedly Trained AI Models on Copyrighted Works
LibGen is a link aggregator that provides access to copyrighted works Meta is facing a copyright lawsuit over allegedly using copyrighted works to train its artificial intelligence (AI) models. The lawsuit was filed by multiple complainants that also include several bestselling authors. The primary allegation against the tech giant is that it used pirated e-books and articles to train the older versions of its Llama AI models, violating copyright laws. Additionally, the filings also accuse company CEO Mark Zuckerberg of allowing its Llama AI team to torrent a sketchy link aggregator to access the copyrighted materials. The information comes from two separate documents filed with the US District Court for the Northern District of California on Wednesday. The documents, from complainants such as authors Sarah Silverman and Ta-Nehisi Coates, highlight Meta's testimony given in late 2024 where it was discovered that Zuckerberg permitted the usage of a dataset called LibGen to train its Llama AI models. Notably, LibGen (short for Library Genesis) is a file-sharing platform that offers free access to academic and general-interest content. Many consider it a pirate library as it gives access to copyrighted works that are otherwise either available behind a paywall or are not digitised at all. The platform has faced several lawsuits and has been ordered to shut down in the past. The filings claim that Meta used the LibGen dataset while having full knowledge that it had pirated content and broke copyright laws. The document also cited a memo to Meta's AI decision-makers that mentions after "escalation to MZ," Meta's AI team "has been approved to use LibGen". Here, MZ is a shorthand for the Meta CEO's name. Additionally, the memo also mentioned that the executives were alerted to the fact that public knowledge about using "a dataset we know to be pirated such as LibGen" could undermine its negotiating position with regulators. The social media giant was also accused of stripping copyright information from the dataset's text and metadata to conceal its infringement. As per the filings, Nikolay Bashlykov, a research engineer working in Meta's AI division allegedly removed copyright information from the LibGen dataset. To further hide the evidence of using the alleged dataset "Meta's programmers included "supervised samples" of data when fine-tuning Llama to ensure Llama's output would include less incriminating answers when answering prompts regarding the source of Meta's AI training data," stated the document. Further, the complainants also alleged that Meta was involved in another kind of copyright infringement just by accessing LibGen. The filings claimed that the tech giant torrented the LibGen dataset. The process of using Torrent includes both downloading as well as uploading (also known as seeding) the content. The process of uploading can be considered distribution of copyright materials and constitute a violation, claimed the filings. "Had Meta bought Plaintiffs' works in a bookstore or borrowed them from a library and trained its Llama models on them without a license, it would have committed copyright infringement. Meta's decision to bypass lawful methods of acquiring books and become a knowing participant in an illegal torrenting network establishes a CDAFA [California Comprehensive Computer Data Access and Fraud Act] violation and serves as proof of copyright infringement," the filings stated. Currently, the copyright lawsuit is open and a ruling is pending. Meta is yet to make its arguments, which are likely to be based on fair usage. The court will have to decide whether the AI model's generative capabilities can be considered transformative enough to validate that argument or not.
[4]
Mark Zuckerberg gave Meta's Llama team the OK to train on copyrighted works, filing claims
Counsel for plaintiffs in a copyright lawsuit filed against Meta allege that Meta CEO Mark Zuckerberg gave the green light to the team behind the company's Llama AI models to use a data set of pirated ebooks and articles for training. The case, Kadrey v. Meta, is one of many against tech giants developing AI that accuse the companies of training models on copyrighted works without permission. For the most part, defendants like Meta have asserted that they're shielded by fair use, the U.S. legal doctrine that allows for the use of copyrighted works to make something new as long as it's sufficiently transformative. Many creators reject that argument. In newly unredacted documents filed with the U.S. District Court for the Northern District of California late Wednesday, plaintiffs in Kadrey v. Meta, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, recount Meta's testimony from late last year, during which it was revealed that Zuckerberg approved Meta's use of a data set called LibGen for Llama-related training. LibGen, which describes itself as a "links aggregator," provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. According to Meta's testimony, as relayed by plaintiffs' counsel, Zuckerberg cleared the use of LibGen to train at least one of Meta's Llama models despite concerns within Meta's AI exec team and others at the company. The filing quotes Meta employees as referring to LibGen as a "data set we know to be pirated," and flagging that its use "may undermine [Meta's] negotiating position with regulators." The filing also cites a memo to Meta AI decision-makers noting that after "escalation to MZ," Meta's AI team "[was] approved to use LibGen." (MZ, here, is rather obvious shorthand for "Mark Zuckerberg.") The details seemingly line up with reporting from The New York Times last April, which suggested that Meta cut corners to gather data for its AI. At one point, Meta was hiring contractors in Africa to aggregate summaries of books and considering buying the publisher Simon & Schuster, according to the Times. But the company's execs determined that it would take too long to negotiate licenses and reasoned that fair use was a solid defense. The filing Wednesday contains new accusations, like that Meta might've tried to conceal its alleged infringement by stripping the LibGen data of attribution. According to plaintiffs' counsel, Meta engineer Nikolay Bashlykov, who works on the Llama research team, wrote a script to remove copyright info, including the word "copyright" and "acknowledgments," from ebooks in LibGen. Separately, Meta allegedly stripped copyright markers from science journal articles and "source metadata" in the training data it used for Llama. "This discovery suggests that Meta strips [copyright information] not just for training purposes," the filing reads, "but also to conceal its copyright infringement, because stripping copyrighted works ... prevents Llama from outputting copyright information that might alert Llama users and the public to Meta's infringement." According to the latest filing, Meta also revealed during depositions that it torrented LibGen, a move that gave some Meta research engineers pause. Torrenting, a way of distributing files across the web, requires that torrenters simultaneously "seed," or upload, the files they're trying to obtain. Plaintiffs' counsel alleges that Meta effectively engaged in another form of copyright infringement by torrenting LibGen and thus helping to spread its contents. Meta also tried to conceal its activities, counsel alleges, by minimizing the number of files it uploaded. According to the filing, Meta's head of generative AI, Ahmad Ah-Dahle, "cleared the path" for torrenting LibGen -- brushing aside Bashlykov's reservations that doing so "could be legally not OK." "Had Meta bought plaintiffs' works in a bookstore or borrowed them from a library and trained its Llama models on them without a license, it would have committed copyright infringement," wrote plaintiffs' counsel in the filing. "Meta's decision to bypass lawful methods of acquiring books and become a knowing participant in an illegal torrenting network ... serves as proof of copyright infringement." The case against Meta is far from decided. As of now, it only pertains to Meta's earliest Llama models -- not its recent releases. And the court may well decide in Meta's favor if it's persuaded by the company's fair use argument. But the allegations don't reflect well on Meta, as the judge presiding over the case, Judge Thomas Hixson, noted in an order on Wednesday rejecting Meta's request to redact large portions of the filing. "It is clear that Meta's sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage," Hixson wrote. "Rather, it is designed to avoid negative publicity." We've reached out to Meta for comment and will update this piece if we hear back.
[5]
Zuckerberg Appeared to Know Meta Trained AI on Pirated Library
What Happened With California's Water Supply During the Wildfires? The AI rush has brought with it thorny questions of copyright and ownership of data as tech companies train bots like ChatGPT on existing texts, but it seems Meta largely brushed these aside as they worked to integrate such tools into Facebook and Instagram. As first revealed in a motion filed by attorneys for novelists Christopher Golden and Richard Kadrey and comedian Sarah Silverman, who are pursuing a class-action suit against Meta for allegedly using their copyrighted work without permission, employees at the tech giant had candid conversations about the potential for scandal that would arise from leveraging a risky resource: Library Genesis, or LibGen, a massive so-called "shadow library" of free downloadable ebooks and PDFs that includes otherwise paywalled research and academic articles. In these exchanges, Meta's engineers identified LibGen as "a dataset we know to be pirated," but indicated that CEO Mark Zuckerberg had approved its use for training the next iteration of its large language model, Llama. Now, under a court order from Judge Vince Chhabria of the U.S. District Court for the Northern District of California, the records of those previously confidential internal dialogues have been unsealed, and appear to confirm Zuckerberg's decision to greenlight the transfer of pirated, copyrighted LibGen data to improve Llama -- despite concerns about a backlash. In an email to Joelle Pineau, vice president of AI research at Meta, Sony Theakanath, director of product management, wrote, "After a prior escalation to MZ [Mark Zuckerberg], GenAI has been approved to use LibGen for Llama 3 [...] with a number of agreed upon mitigations." The note observed that including the LibGen material would help them reach certain performance benchmarks, and alluded to industry rumors that other AI companies, including OpenAI and Mistral AI, are "using the library for their models." In the same email, Theakanath wrote that under no circumstances would Meta publicly disclose its use of LibGen. The same email lays out the legal exposures and potential negative media attention that could follow if "external parties" deduce that the LibGen trove formed part of Llama's training data: "Copyright and IP is top of mind for legislators around the world, including in the US and EU," the document states. "US legislators expressed concern in a recent hearing about AI developers using pirated websites for training. It's unclear what their legislative actions would be if the concern spreads, but it reflects some of the negative lobbying right holders have been doing, related to our litigation on this topic (along the lines that this is 'stolen' content that then taints the output of this model)." Meta did not immediately return a request for comment on these internal communications. Elsewhere in the unsealed documents, Meta employees describe methods for processing and filtering text from LibGen in order to remove "boilerplate" indications of copyright, such as "ISBN," "Copyright," "©," and "All rights reserved." The author of a memo titled "Observations on LibGen-SciMag" ("SciMag" is the library's catalogue of science journals) reports that the material's "quality is high and the documents are long so this should be great data to learn from, in particular, for highly specialized knowledge!" The same memo recommends trying to "remove more copyright headers and document identifiers" -- seemingly more evidence that Meta was looking to cover its tracks as it exploited this cache of technical text that it did not have permission to use. Other revealing messages show Meta's AI research team and executives discussing best methods for obtaining the LibGen data set besides directly torrenting it, or downloading via peer-to-peer file sharing, from the company's IP addresses. At some points, employees wondered if this was even allowed. "I think torrenting from a corporate laptop doesn't feel right," wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the "SciMag" data had indeed been torrented.) And in October 2023 messages to a researcher working on Llama, Ahmad Al-Dahle, vice president of GenAI at Meta, said he had "cleared the path to use" LibGen and was "pushing from the top" to incorporate other data sets to improve Llama and win the AI race. It's no wonder Meta fought the unsealing and unredacting of these discussions as the discovery period in the copyright lawsuit came to an end: they seem to damage the company's argument that "using text to statistically model language and generate original expression" falls under the legal rubric of fair use, or the permissible limited use of copyrighted material without permission, as its lawyers put it in a motion to dismiss the suit. The plaintiffs' attorneys, moreover, recorded in their latest filing that Zuckerberg himself in a recent deposition said that the kind of piracy described in their latest amended complaint would raise "lots of red flags" and "seems like a bad thing." Of course, Meta, which Tuesday announced it will be cutting the 5 percent of its workforce deemed its "lowest performers," or some 3,600 workers, is hardly alone as a Silicon Valley behemoth accused of flouting (or circumventing) copyright law. This class action could prove a bellwether for the many other suits in progress against AI companies regarding the ownership of photographs, art, music, journalism, books, and more. But as long as tech firms are hungrily searching for more stuff for its bots to replicate and remix, they will always be reliant on the original content creators: human beings.
[6]
Zuckerberg Knowingly Used Pirated Data to Train Meta AI, Authors Allege - Decrypt
Mark Zuckerberg approved using pirated books to train Meta AI, even after his own team warned the material was illegally obtained, a group of authors allege in a recent court filing. The allegations come from a copyright infringement lawsuit filed by a group of authors including the comedian Sarah Silverman, Christopher Golden, and Richard Kadrey in a California federal court in July 2023. The group claimed Meta misused their books to train its Llama LLM, and they're asking for damages and an injunction to stop Meta from using their works. The judge in the case dismissed most of the author's claims in November of that same year, but these recent allegations may breathe new life into the legal dispute. "Meta's CEO, Mark Zuckerberg, approved Meta's use of the LibGen dataset notwithstanding concerns within Meta's AI executive team (and others at Meta) that LibGen is 'a dataset we know to be pirated,'" lawyers for the plaintiffs said in a Wednesday filing. Despite these red flags, the lawsuit alleges that, "after escalation," Zuckerberg gave the green light for Meta's AI team to proceed with using the controversial dataset. Representatives for Meta did not immediately respond to Decrypt's request for comment. LibGen, short for Library Genesis, is an online platform that provides free access to books, academic papers, articles, and other written publications without properly abiding by copyright laws. It operates as a "shadow library," offering these materials without authorization from publishers or copyright holders. It currently hosts over 33 million books and over 85 million articles. The lawsuit alleges Meta tried to keep this under wraps until the last possible moment. Just two hours before the fact discovery deadline on December 13, 2024, the company dumped what plaintiffs describe as "some of the most incriminating internal documents it has produced to date." Meta's own engineers seemed uncomfortable with the plan, according to statements in court filings. The group of authors allege internal messages show Meta engineers hesitated to download the pirated material, with one noting that "torrenting from a [Meta-owned] corporate laptop doesn't feel right (smile emoji)." Nevertheless, they proceeded to not only download the books but also systematically strip out copyright information to prepare them for AI training, the lawsuit claims. The latest filings in the lawsuit paint a picture of a company fully aware of the risks: One internal memo warned that "media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, may undermine our negotiating position with regulators." Yet Meta went ahead anyway, both downloading and distributing (or "seeding") the pirated content through torrenting networks by January 2024, according to the lawsuit. When questioned about these activities in a deposition, Zuckerberg appeared to distance himself from the decision, testifying that such piracy would raise "lots of red flags" and "seems like a bad thing." The court documents also suggest that Meta's approach to handling copyrighted information paid more attention to model training than copyright rules. According to the filing, one engineer "filtered [...] copyright lines and other data out of LibGen to prepare a CMI-stripped version of it to train Llama." This systematic removal of copyright information could strengthen the authors' claims that Meta knowingly tried to hide its use of pirated materials. The revelations come at a crucial time for Meta's AI ambitions. The company has been pushing hard to compete with OpenAI and Google in the AI space, with Llama 3.2 being the most popular open source LLM, and Meta AI being a solid free competitor to ChatGPT with similar features. Most of these AI companies are facing legal battles due to their questionable practices when it comes to training their large language models. Meta was already sued by another group of authors for copyright infringements, OpenAI is currently facing different lawsuits for training its LLMs on copyrighted material, and Anthropic is also facing different accusations from authors and songwriters. But in general the tech entrepreneurs and creators have been up in arms ever since generative AI exploded in popularity. There are currently dozens of different lawsuits against AI companies for willingly using copyrighted material to train their models. But as with most things on the bleeding edge, we'll have to wait and see what the courts have to say about it all.
[7]
Zuckerberg approved Meta's use of 'pirated' books to train AI models, authors claim
Sarah Silverman and others file court case claiming CEO approved use of dataset despite warnings Mark Zuckerberg approved Meta's use of "pirated" versions of copyright-protected books to train the company's artificial intelligence models, a group of authors has alleged in a US court filing. Citing internal Meta communications, the filing claims that the social network company's chief executive backed the use of the LibGen dataset, a vast online archive of books, despite warnings within the company's AI executive team that it is a dataset "we know to be pirated". The internal message says that using a database containing pirated material could weaken the Facebook and Instagram owner's negotiations with regulators, according to the filing. "Media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, may undermine our negotiating position with regulators." The US author Ta-Nehisi Coates, the comedian Sarah Silverman and the other authors suing Meta for copyright infringement made the accusations in a filing made public on Wednesday, in a California federal court. The authors sued Meta in 2023, arguing that the social media company misused their books to train Llama, the large language model that powers its chatbots. The Library Genesis, or LibGen, dataset is a "shadow library" that originated in Russia and claims to contain millions of novels, nonfiction books and science magazine articles. Last year a New York federal court ordered LibGen's anonymous operators to pay a group of publishers $30m (£24m) in damages for copyright infringement. Use of copyrighted content in training AI models has become a legal battleground in the development of generative AI tools such as the ChatGPT chatbot, with creative professionals and publishers warning that using their work without permission is endangering their livelihoods and business models. The filing cites a memo, referring to Mark Zuckerberg's initials, noting that "after escalation to MZ", Meta's AI team "has been approved to use LibGen". Quoting internal communications, the filing also says Meta engineers discussed accessing and reviewing LibGen data but hesitated on starting that process because "torrenting", a term for peer-to-peer sharing of files, from "a [Meta-owned] corporate laptop doesn't feel right". A US district judge, Vince Chhabria, last year dismissed claims that text generated by Meta's AI models infringed the authors' copyrights and that Meta unlawfully stripped their books' copyright management information (CMI), which refers to information about the work including the title, name of the author and copyright owner. However, the plaintiffs were given permission to amend their claims. The writers argued this week that the evidence bolstered their infringement claims and justified reviving their CMI case and adding a new computer fraud allegation. Chhabria said during a hearing on Thursday that he would allow the writers to file an amended complaint but expressed scepticism about the merits of the fraud and CMI claims.
[8]
Lawsuit says Mark Zuckerberg approved Meta's use of pirated materials to train Llama AI
Meta allegedly used LibGen for AI training and even stripped copyright information from its materials. Meta knowingly used pirated materials to train its Llama AI models -- with the blessing of company chief Mark Zuckerberg -- according to an ongoing copyright lawsuit against the company. As TechCrunch reports, the plaintiffs of the Kadrey v. Meta case submitted court documents talking about the company's use of of the LibGen dataset for AI training. LibGen is generally described as a "shadow library" that provides file-sharing access to academic and general-interest books, journals, images and other materials. The counsel for the plaintiffs, which include writers Sarah Silverman and Ta-Nehisi Coates, accused Zuckerberg of approving the use of LibGen for training despite concerns raised by company executives and employees who described it as a "dataset [they] know to be pirated." The company removed copyright information from LibGen materials, the complaint also said, before feeding them to Llama. Meta apparently admitted in a document submitted to court that it "remov[ed] all the copyright paragraphs from beginning and the end" of scientific journal articles. One of its engineers even reportedly made a script to automatically delete copyright information. The counsel argued that Meta did so to conceal its copyright infringement activities from the public. In addition, the counsel mentioned that Meta admitted to torrenting LibGen materials, even though its engineers felt uneasy about sharing them "from a [Meta-owned] corporate laptop." Silverman, alongside other writers, sued Meta and OpenAI for copyright infringement in 2023. They accused the companies of using pirated materials from shadow libraries to train their AI models. The court previously dismissed some of their claims, but the plaintiffs said their amended complaint supports their allegations and addresses the court's earlier reasons for dismissal.
[9]
Mark Zuckerberg named in lawsuit over Meta's use of pirated books for AI training
A group of authors, including Ta-Nehisi Coates and Sarah Silverman, alleged in a court filing that Meta CEO Mark Zuckerberg approved "Meta's torrenting and processing of pirated copyrighted works" to train the company's AI models. The California filing, which was made public on Wednesday, claims that Zuckerberg supported the use of the LibGen dataset, an archive that originated in Russia and contains a library of pirated books, to train its Llama AI. In a document submitted to the court, Meta admitted that it "removed [ed] all the copyright paragraphs from the beginning and the end" of scientific journal articles, Engadget reported. The suit alleges that this was explicitly done to hide the fact that Meta was using copyrighted materials. Clearly, Meta did not want this information to be made public. The Guardian reported that the filing stated: "Media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, may undermine our negotiating position with regulators." All the while, Meta has been integrating Llama AI further into its apps and services. This comes just a few days after Zuckerberg announced that he is replacing fact-checkers with Community Notes, lifting prohibitions against some discriminatory and hateful rhetoric on its platforms, and pushing more political content on Instagram and Threads.
[10]
Lawsuit Alleges Mark Zuckerberg Gave Permission for Meta to Train AI on Stolen Content
An amended lawsuit against Meta alleges that the company's CEO, Mark Zuckerberg, approved training Meta's Llama AI models using copyrighted content. The initial lawsuit was filed in 2023, alleging that Meta had trained its AI using copyrighted content. However, courts have favored Meta's position so far, with a U.S. district judge, Vince Chhabria, siding with Meta last year but allowing plaintiffs to amend their claims. In amended legal documents filed with the United States District Court for the Northern District of California this week, the plaintiffs present new evidence that "corroborate and bolster" their original copyright infringement claim. The amended lawsuit says that Zuckerberg was aware of the decision to torrent from LibGen and that he had "approved" the use of LibGen. Allegations that Meta, like other technology companies developing AI products and services, ignored copyright laws are nothing new. However, what is different with the latest update to the lawsuit is that the plaintiffs allege that Meta, from the top down, acted to conceal its malfeasance. It is one thing to illegally use copyrighted content, and another issue altogether to try to conceal the behavior. Plaintiffs claim that Meta downloaded content from LibGen, a "shadow library" that provides file-sharing access to copyrighted content, knowing that the troubled platform was distributing content illegally. Further, Meta acted as a "leech," siphoning as much content as possible from LibGen without sharing resources back, a behavior the lawsuit claims corroborates claims that Meta was trying to conceal its behavior. "Meta's decision to 'leech' the torrent reflects its knowledge that it was acting illegally and its desire to conceal its conduct, because the more illicit data it uploaded to the rest of the hacker community, the more likely its conduct would be discovered by others," the lawsuit explains. "Indeed, that uploading occurred notwithstanding serious legality doubts from the very person who did the downloading," the filing continues. Meta's head of generative AI, Ahmad Ah-Dahle, allegedly approved the use of LibGen "and other illicit sources." Further, Meta allegedly stripped copyright management information (CMI) from the data, a violation of the Digital Millenium Copyright Act (DMCA) and an attempt to "facilitate and conceal widespread copyright infringement." As for LibGen, otherwise known as Library Genesis, it illegally hosts millions of copyrighted files including non-fiction and fiction books, magazine articles, scientific journals, and more. A U.S. judge ordered LibGen to pay $30 million in penalties last year. However, the penalties are unlikely to be paid since it is unclear who operates LibGen. What is clear is that LibGen is well known as an illegal supplier of stolen materials, a fact that Meta would be entirely aware of.
[11]
Meta AI copyright case: Who's liable for using open-source data?
A group of authors in the US -- Richard Kadrey, Sarah Silverman and Christopher Golden claim that Meta torrented and processed their books to train its artificial intelligence models. The authors had originally filed a case against Meta's copyright infringement of their works back in 2023. In their original lawsuit, the authors mentioned that the company used their books as a part of the training dataset "without consent, without credit and without compensation". As per a recent discovery order in the case, Meta CEO Mark Zuckerberg gave the company's AI team orders to use the LibGen dataset. The fact discovery in the case also reveals that Meta engineers filtered out lines of copyright management information (CMI) to train its AI model Llama. This is despite the fact that in response to piracy discussions during the case, Zuckerberg said that such activity "raises lots of red flags." Library Genesis or Lib-Gen is a file-sharing project that provides access to copyrighted works like academic books and journal articles. Besides the current lawsuit, publishers have in the past also filed court cases against LibGen for facilitating online piracy. For instance, in India, Elsevier, The American Chemical, and Wiley Publications filed a court case against LibGen and fellow free academic paper-accessing site SciHub in India in 2020 for copyright infringement. The publishers urged the country to permanently block the sites in India. Internet Service Providers (ISPs) across multiple countries including the United Kingdom, France, and Germany block access to LibGen links. In their 2023 lawsuit, the authors mentioned that in a table describing the contents of its AI models' training dataset, Meta had listed that 85 gigabytes (GB) came from books. These books come from two sources: The authors say that the data from the Pile's Book3 data set includes a mix of fiction and non-fiction books from the "shadow library" Bibliotik similar to other free book sites like Zlib, LibGen, and SciHub. "These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host," the lawsuit suggests. While it seems here that Zuckerberg knew about the presence of LibGen sources in the company's database, this lawsuit brings up some key questions about liability in the AI ecosystem. Meta acquired this data through an open-source dataset, which was accumulated from Bibliotik Private Tracker, part of the many shadow libraries like LibGen. If an AI company obtained the data through open-source datasets, which in turn obtained it from shadow libraries, who bears liability at each step of this chain? While some have filed cases against piracy websites in the past, should open-source dataset creators also bear liability for infringement? Meta isn't the only company to face a lawsuit for allegedly using copyrighted information to train its models. In the recent past, others like Microsoft and OpenAI have also been hit with a series of lawsuits from authors and news publications suggesting that the two used their information to train AI models. In what appears to be a response to these cases, OpenAI signed an agreement with the News Corp which gave it access to the content of major news publications like The Wall Street Journal, New York Post, and The Daily Telegraph in May last year. Microsoft similarly signed an agreement with Harper Collins to train AI on its books in November last year. Just like OpenAI and Microsoft, Meta also signed a content licensing deal with Reuters in October last year.
[12]
Meta knew it used pirated books to train AI, authors say
(Reuters) - Meta Platforms used pirated versions of copyrighted books to train its artificial intelligence systems with approval from its CEO Mark Zuckerberg, a group of authors alleged in newly disclosed court papers. Ta-Nehisi Coates, comedian Sarah Silverman and other authors suing Meta for copyright infringement made the accusations in filings made public on Wednesday in California federal court. They said internal documents produced by Meta during the discovery process showed the company knew the works were pirated. Spokespeople for Meta did not immediately respond to a request for comment. The authors sued Meta in 2023, arguing that the tech giant misused their books to train its large language model Llama. The case is one of several alleging that copyrighted works by authors, artists and others were used to develop AI products without permission. Defendants have argued that they made fair use of copyrighted material. The authors asked the court on Wednesday for permission to file an updated complaint. They said new evidence showed Meta used the AI training dataset LibGen, which allegedly includes millions of pirated works, and distributed it through peer-to-peer torrents. They said internal Meta communications showed Zuckerberg "approved Meta's use of the LibGen dataset notwithstanding concerns within Meta's AI executive team (and others at Meta) that LibGen is 'a dataset we know to be pirated.'" U.S. District Judge Vince Chhabria last year dismissed claims that text generated by Meta's chatbots infringed the authors' copyrights and that Meta unlawfully stripped their books' copyright management information (CMI). The writers argued Wednesday that the evidence bolstered their infringement claims and justified reviving their CMI claim and adding a new computer fraud claim. Chhabria said during a hearing on Thursday that he would allow the writers to file an amended complaint but expressed skepticism about the merits of the fraud and CMI claims. (Reporting by Blake Brittain in Washington; Editing by David Bario and Aurora Ellis)
[13]
Court docs allege Meta trained AI model using LibGen
Did Zuck's definition of 'free expression' just get even broader? Meta allegedly downloaded material from an online source that's been sued for breaching copyright, because it wanted the material to train its AI models, according to a new court filing. The accusation was made in a document [PDF] filed in the case of Richard Kadrey et al vs Meta Platforms, in which novelist Kadrey (and others including comedian Sarah Silverman) allege stolen versions of their work were used to train AI models. Several similar suits are in motion, targeting different AI players. The document claims that Meta decided to download documents from Library Genesis - aka "LibGen" to train its models. LibGen is the subject of a lawsuit brought by textbook publishers who believe it happily hosts and distributes stolen works, and even accepts donations to fund its operations. The filing from plaintiffs in the Kadrey case claims that documents produced by Meta during the discovery process - the pre-trial activity of gathering relevant documents - describe internal debate about accessing LibGen, a little squeamishness about using BitTorrent in the office to do so, and eventual escalation to "MZ" who approved use of the contentious resource. The filing states that evidence about use of LibGen is new and was made available by Meta late in the discovery process. Another filing [PDF] claims that a Meta document describes how it removed copyright notifications from material downloaded from LibGen, and suggests the company did so because it realized including such text could mean a model's output would reveal it was trained on copyrighted material. A third document [PDF], this one filed by Meta, argues that the plaintiffs have unjustifiably claimed that use of LibGen is new material and contends that it was on the record for months. The nub of the matter appears to be an attempt by the plaintiffs to use the info about Meta's user of LibGen to add an action under the California Comprehensive Computer Data Access and Fraud Act. That law makes it a crime to access a computer or network without permission with the intent to defraud or commit other crimes. Meta doesn't think the extra action is justified. Meta's filing includes a statement that the company "rejects the notion that it has 'distributed' LibGen", seemingly to address plaintiff's arguments that merely using BitTorrent meant it spread stolen content to others. But if there's a denial that LibGen was accessed, we can't find it. Meta tried to have the filings we've linked to above sealed on grounds of commercial sensitivity. The judge in the case rejected that, arguing that Meta just wants to avoid publicity. US District Court Judge Vince Chhabria also noted that in one of the documents Meta wants to seal, an employee wrote the following: Sorry if we undermined you, Zuck. The allegation of using LibGen is very on-brand for Meta, given its business model is built on free content contributed by users. Why should pesky authors be treated any different? ®
[14]
Meta accused of training its AI using pirated content from torrents
A new day, a new controversy around artificial intelligence. This time, Meta has been accused of using pirated content from torrents to train its large language model (LLM) Llama, which powers Meta AI. The case was one of the first copyright lawsuits filed against a tech company for training AI. As reported by Wired, Meta was hit with a lawsuit in 2023 for allegedly training Llama, the company's LLM, with pirated content. The case became known as "Kadrey et al. v. Meta Platforms" and was filed by novelists Richard Kadrey and Christopher Golden, who claimed that Meta used copyrighted content without authorization. Until now, Meta had handed over documents with redacted information to the court, but Judge Vince Chhabria of the United States District Court for the Northern District of California ordered that the original documents should be made public - and that's what happened. The documents reveal conversations between Meta employees about Meta AI and Llama. In one of the conversations, an engineer says that "torrenting from a [Meta-owned] corporate laptop doesn't feel right," which corroborates that the company used pirated content to train its AI. Another conversation suggests that "MZ" (Mark Zuckeberg) authorized the use of pirated material. Evidence suggests that Meta used content from LibGen, a huge library of pirated books, magazines and academic articles. LibGen was created in Russia in 2008 and has been hit by multiple copyright lawsuits since then, even though no one knows who actually operates the "piracy hub." Meta also reportedly used content from other "shadow libraries" for AI training. The company argues that it used public materials under the legal doctrine of "fair use," which allows the use of copyrighted content without permission in certain circumstances, which are analyzed on a case-by-case basis. Meta also claims that it's just "using text to statistically model language and generate original expression." This is not the first time that big techs have been accused of training AI models with copyrighted content. Last year, an investigation revealed that the OpenELM model created by Apple included subtitles from more than 170,000 YouTube videos. Although at first this led people to believe that Apple was using copyrighted content to train Apple Intelligence, the company later explained that OpenELM was an open-source model created for research purposes and that its database is not used to power Apple Intelligence. According to Apple, its AI features available on iOS and macOS are trained "on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler." It's worth noting that many large publishers such as The New York Times and The Atlantic have chosen not to share their content with Apple Intelligence training.
[15]
Meta Secretly Trained Its AI on a Notorious Russian 'Shadow Library,' Newly Unredacted Court Docs Reveal
One of the most important AI copyright legal battles just took a major turn. Meta just lost a major fight in its ongoing legal battle with a group of authors suing the company for copyright infringement over how it trained its artificial intelligence models. Against the company's wishes, a court unredacted information alleging that Meta used Library Genesis (LibGen), a notorious so-called shadow library of pirated books that originated in Russia, to help train its generative AI language models. The case, Kadrey et al v. Meta Platforms, was one of the earliest copyright lawsuits filed against a tech company over its AI training practices. Its outcome, along with those of dozens of similar cases working their way through courts in the United States alone, will determine whether technology companies can legally use creative works to train AI moving forward, and could either entrench AI's most powerful players or derail them. Vince Chhabria, a judge for the United States District Court for the Northern District of California, ordered both Meta and the plaintiffs on Wednesday to file full versions of a batch of documents after calling Meta's approach to redacting them "preposterous," adding that, for the most part, "there is not a single thing in those briefs that should be sealed." Chhabria ruled that Meta was not pushing to redact the materials in order to protect its business interests, but instead to "avoid negative publicity." The documents were originally filed late last year, but remained publicly unavailable until now. In his order, Chhabria referenced an internal quote from a Meta employee included in the documents, in which they speculated that "If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues." Meta declined to comment. Novelists Richard Kadrey and Christopher Golden, along with comedian Sarah Silverman, first filed the class-action lawsuit against Meta in July 2023, alleging the tech giant trained its language models using their copyrighted work without permission. Meta has argued that using publicly available materials to train AI tools is shielded by the "fair use" doctrine, which holds that using copyrighted works without permission is legal in certain cases, one of which, the company argues, is "using text to statistically model language and generate original expression," the company's lawyers wrote in a motion to dismiss the authors' lawsuit in November 2023. In this particular lawsuit, Meta has also argued that the plaintiffs' claims are without merit. Before these documents were made public, Meta previously disclosed in a research paper that it had trained its Llama large language model on portions of Books3, a dataset of around 196,000 books scraped from the internet. It had not previously publicly indicated, however, that it had torrented data directly from LibGen. These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because "torrenting from a [Meta-owned] corporate laptop doesn't feel right 😃". They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as "MZ" in the memo handed over during discovery) and that Meta's AI team was "approved to use" the pirated material.
[16]
Facebook Apparently Trained Its AI by Torrenting Pirated Books Stolen From Authors
And Zuckerberg personally approved the piracy, according to these documents. Newly unredacted court documents allege that Meta, formerly Facebook, knowingly used pirated books obtained from the online archive Library Genesis to train its AI models, Wired reports. Submitted in an ongoing lawsuit filed against the platform by a group of authors including Ta-Nehisi Coates and comedian Sarah Silverman, the documents were finally released in full after a judge shot down Meta's attempts to keep portions of them sealed. The judge argued, per Wired, that Meta fought for the redactions merely to "avoid negative publicity," citing a damning internal quote from one of its employees. "If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues," the employee wrote. Library Genesis, or LibGen, is a "shadow library" that provides free access to millions of books, academic articles, and magazines. That a multibillion-dollar corporation like Meta would tap into its store of pirated content is the latest sign of the impunity that tech companies have operated with to train their large language models, vacuuming up copyrighted content en masse with seemingly no regard for the law -- or even the decency, as one of the world's most valuable companies, to buy a single copy of each volume it was using to power its AI. Meta and other AI leaders argue that using books and other data scraped from the web constitutes "fair use," but it will ultimately be up to legal battles like this one to determine if that is the case. Fair use or not, some of the exchanges exposed in the newly unredacted documents suggest that Meta employees knew that what they were doing was legally and ethically dicey. With a grinning emoji, one engineer wrote: "Torrenting from a [Meta-owned] corporate laptop doesn't feel right," as quoted by Wired. And it goes all the way to the top. A cited memo allegedly shows that after employee discussions about using LibGen were escalated to "MZ" -- Meta CEO Mark Zuckerberg -- the AI team was "approved to use" material from the database. The plaintiffs argue that this shatters any plausibility that Meta may try to maintain. "Meta has treated the so-called 'public availability' of shadow datasets as a get-out-of-jail-free card, notwithstanding that internal Meta records show every relevant decision-maker at Meta, up to and including its CEO, Mark Zuckerberg, knew LibGen was 'a dataset we know to be pirated,'" they wrote in the most recent motion. Moreover, the authors point to testimony from a Meta corporative representative as an admission that the company also helped disseminate the pirated books by "seeding" their corresponding torrents, or uploading portions of the material so that other users could download them.
[17]
Meta trained Llama on copyrighted material, new filing claims
Meta requested that a 'large portion' of the new filing be redacted, the judge however denied the request. In a new filing, the counsel for the trio of authors suing Meta claim that the company allowed its artificial intelligence (AI) large language model Llama to commit copyright infringement on pirated data and upload it for commercial gain. The lawsuit was initially filed in July 2023 in the Northern District Court of California by authors Richard Kadrey, Christopher Golden and Sarah Silverman, and is just one of the many ongoing AI copyright-related legal battles against Big Tech. The fresh document filed on 8 January alleges "newly discovered evidence". The new allegations claim that Meta was aware that its training dataset for Llama contained copyrighted material that it used without permission, and that the company removed the copyright management information (CMI) from the works before processing the data for its AI model. The discovery, the filing adds, "suggests" that the company strips CMI "not just for training purposes, but also to conceal its copyright infringement". It also claims that meta went out of its way to include "supervised samples of data" to ensure that Llama's output "would include less incriminating answers" when responding to prompts regarding the source of its training data. Moreover, the filing also claims that Meta downloaded content from Libgen, a torrent website which provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education, for Llama's training, acting as a "leech" on pirated data. Meanwhile, the filing also noted that a Meta corporate representative, who testified under oath in November last year, "admitted" to uploading pirated files that contained the plaintiffs' works on torrent websites. Vince Chhabria, the judge presiding over the case, who rejected Meta's request to redact "large portions" of the new filing said: "It is clear that Meta's sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage...rather, it is designed to avoid negative publicity." However, in November 2023 Chhabria granted Meta's motion to dismiss all claims in the lawsuit except for the one which alleged that Meta, without authorisation, copied plaintiff's books for training llama. The judge, while granting the motion for dismissal said that the other allegations, which included claims which alleged that Llama "cannot function without" information extracted from the plaintiffs books were "nonsensical". Don't miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic's digest of need-to-know sci-tech news.
Share
Share
Copy Link
Meta CEO Mark Zuckerberg defends the use of copyrighted e-books to train AI models, comparing it to YouTube's content moderation challenges. The case raises questions about fair use in AI development.
In a high-profile lawsuit, Meta faces allegations of using copyrighted materials to train its AI models without proper authorization. The case, Kadrey v. Meta, involves bestselling authors Sarah Silverman and Ta-Nehisi Coates as plaintiffs, challenging the tech giant's practices in AI development 1.
During a deposition, Meta CEO Mark Zuckerberg drew a controversial parallel between Meta's use of copyrighted e-books and YouTube's content moderation challenges. He argued that, like YouTube, which may temporarily host pirated content, it's not always rational to completely avoid using certain datasets in AI training 2.
Zuckerberg stated, "So would I want to have a policy against people using YouTube because some of the content may be copyrighted? No. There are cases where having such a blanket ban might not be the right thing to do" 2.
At the heart of the lawsuit is Meta's alleged use of LibGen, a controversial "links aggregator" providing access to copyrighted works. Court filings suggest that Zuckerberg approved the use of LibGen for training Meta's Llama AI models, despite internal concerns about legal implications 3.
Plaintiffs' counsel alleges that Meta attempted to conceal its use of copyrighted materials. According to the filings, Meta engineer Nikolay Bashlykov wrote a script to remove copyright information from ebooks in LibGen. The company is also accused of stripping copyright markers from science journal articles and source metadata in the training data 4.
The lawsuit also claims that Meta torrented the LibGen dataset, potentially engaging in another form of copyright infringement by participating in the distribution of copyrighted materials. This decision allegedly raised concerns among some Meta research engineers 4.
Meta's primary defense rests on the fair use doctrine, arguing that using text to statistically model language and generate original expression falls under permissible use of copyrighted material. However, the recently unsealed documents appear to challenge this argument 5.
This case is part of a larger debate surrounding AI companies' use of copyrighted works for training. The outcome could set a precedent for how fair use is interpreted in the context of AI development, potentially affecting the entire tech industry's approach to AI training data 1.
As the AI industry continues to grapple with these legal and ethical challenges, the resolution of this case may have far-reaching implications for the future of AI development and copyright law in the digital age.
Reference
[1]
[3]
[4]
[5]
Meta CEO Mark Zuckerberg is set to be deposed in a copyright infringement lawsuit filed by comedian Sarah Silverman and other authors. The case centers on the alleged use of copyrighted material to train AI language models.
4 Sources
As 2025 approaches, the AI industry faces crucial legal battles over copyright infringement, with potential outcomes that could significantly impact its future development and business models.
2 Sources
Meta's decision to open-source LLaMA 3.1 marks a significant shift in AI development strategy. This move is seen as a way to accelerate AI innovation while potentially saving Meta's Metaverse vision.
6 Sources
Court documents reveal Meta's intense focus on beating OpenAI's GPT-4 in AI development, highlighting the competitive landscape in the AI industry and raising questions about data usage practices.
2 Sources
Meta CEO Mark Zuckerberg criticizes Apple's closed ecosystem and promotes open-source AI development. He outlines Meta's AI strategy and the benefits of a more open approach in tech innovation.
11 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved