5 Sources
5 Sources
[1]
World's largest shadow library made a 300TB copy of Spotify's most streamed songs
The world's largest shadow library -- which is increasingly funded by AI developers -- shocked the Internet this weekend by announcing it had "backed up Spotify" and started distributing 300 terabytes of metadata and music files in bulk torrents. According to Anna's Archive, the data grab represents more than 99 percent of listens on Spotify, making it "the largest publicly available music metadata database with 256 million tracks." It's also "the world's first 'preservation archive' for music which is fully open," with 86 million music files, the archive boasted. The music files supposedly represent about 37 percent of songs available on Spotify as of July 2025. The scraped files were prioritized by popularity, with Anna's Archive weeding out many songs that are never streamed or are of poor quality, such as AI-generated songs. Spotify did not immediately respond to Ars' request to comment. But the music streaming giant told Android Authority on Monday that it was investigating whether Anna's Archive had actually scraped its platform "at scale," as its blog claimed. "An investigation into unauthorized access identified that a third party scraped public metadata and used illicit tactics to circumvent DRM to access some of the platform's audio files," Spotify said. "We are actively investigating the incident." It's unclear how much Spotify data was actually scraped, Android Authority noted, or if the company will possibly pursue legal action to take down the torrents. For Anna's Archive, the temptation to scrape the data may have been too much after stumbling upon "a way to scrape Spotify at scale," supposedly "a while ago." "We saw a role for us here to build a music archive primarily aimed at preservation," the archive said. Scraping Spotify data was a "great start," they said, toward building an "authoritative list of torrents aiming to represent all music ever produced." A list like that "does not exist for music," the archive said, and would be akin to LibGen -- which was used by tech giants like Meta and startups like Anthropic to notoriously pirate book datasets to train AI. Releasing the metadata torrents this December was the first step toward achieving this "preservation" mission, Anna's Archive said. Next, the Archive will release torrents of music files, starting with the most popular streams first, then eventually releasing torrents of less popular songs and album art. In the future, "if there is enough interest, we could add downloading of individual files to Anna's Archive," the blog said. "This is insane": Users fear data grab will doom archive Anna's Archive claimed that the Spotify data was scraped to help preserve "humanity's musical heritage," protecting it "forever" from "destruction by natural disasters, wars, budget cuts, and other catastrophes." However, some Anna's Archive fans -- who largely use the search engine to find books, academic papers, and magazine articles -- were freaked out by the news that Spotify data was scraped. On Hacker News, some users questioned whether the data would be useful to anyone but AI researchers, since searching bulk torrents for individual songs seemed impractical for music fans. One user pointed out that "there are already tools to automatically locate and stream pirated TV and movie content automatic and on demand" -- suggesting that music fans could find a way to stream the data. But others worried Anna's Archive may have been baited into scraping Spotify, perhaps taking on legal risks that AI companies prone to obscuring their training data sources likely wish to avoid. "This is insane," a top commenter wrote. "Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff. Or if the major record labels already license their entire catalogs for training purposes cheaply enough, so this really is just solely intended as a preservation effort?" But Anna's Archive is clearly working to support AI developers, another noted, pointing out that Anna's Archive promotes selling "high-speed access" to "enterprise-level" LLM data, including "unreleased collections." Anyone can donate "tens of thousands" to get such access, the archive suggests on its webpage, and any interested AI researchers can reach out to discuss "how we can work together." "AI may not be their original/primary motivation, but they are evidently on board with facilitating AI labs piracy-maxxing," a third commenter suggested. Meanwhile, on Reddit, some fretted that Anna's Archive may have doomed itself by scraping the data. To them, it seemed like the archive was "only making themselves a target" after watching the Internet Archive struggle to survive a legal attack from record labels that ended in a confidential settlement last year. "I'm furious with AA for sticking this target on their own backs," a redditor wrote on a post declaring that "this Spotify hacking will just ruin the actual important literary archive." As Anna's Archive fans spiraled, a conspiracy was even raised that the archive was only "doing it for the AI bros, who are the ones paying the bills behind the scenes" to keep the archive afloat. Ars could not immediately reach Anna's Archive to comment on users' fears or Spotify's investigation. On Reddit, one user took comfort in the fact that the archive is "designed to be resistant to being taken out," perhaps preventing legal action from ever really dooming the archive. "The domain and such can be gone, sure, but the core software and its data can be resurfaced again and again," the user explained. But not everyone was convinced that Anna's Archive could survive brazenly torrenting so much Spotify data. "This is like saying the Titanic is unsinkable" that user warned, suggesting that Anna's Archive might lose donations if Spotify-fueled takedowns continually frustrate downloads over time. "Sure, in theory data can certainly resurface again and again, but doing so each time, it will take money and resources, which are finite. How many times are folks willing to do this before they just give up?"
[2]
Activist group says it has scraped 86m music files from Spotify
Platform with 700m users worldwide says it is investigating after Anna's Archive claims to have accessed tracks and metadata An activist group has claimed to have scraped millions of tracks from Spotify and is preparing to release them online. Observers said the apparent leak could boost AI companies looking for material to develop their technology. A group called Anna's Archive said it had scraped 86m music files from Spotify and 256m rows of metadata such as artist and album names. Spotify, which hosts more than 100m tracks, confirmed that the leak does not represent its entire inventory. The Stockholm-based company, which has more than 700m users worldwide, said it had "identified and disabled the nefarious user accounts that engaged in unlawful scraping". "An investigation into unauthorized access identified that a third party scraped public metadata and used illicit tactics to circumvent DRM [digital rights management] to access some of the platform's audio files. We are actively investigating the incident," said Spotify. Spotify does not believe the music taken by Anna's Archive has been released yet. Anna's Archive, which is known for providing links to pirated books, said in a blog it wanted to create a "'preservation archive' for music". The group claimed the audio files represent 99.6% of all music listened to by Spotify users and would be shared via "torrents" - a means of sharing large digital files online. "Of course Spotify doesn't have all the music in the world, but it's a great start," said Anna's Archive, which describes its mission as "preserving humanity's knowledge and culture". "With your help, humanity's musical heritage will be forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes," said the group. Ed Newton-Rex, a composer and campaigner for protecting artists' copyright, said the leaked music would probably be used for developing AI models. "Training on pirated material is sadly common in the AI industry, so this stolen music is almost certain to end up training AI models. This is why governments must insist AI companies reveal the training data they use," he said. The Anna's Archive site makes references to LibGen, a vast online archive of pirated books that has allegedly been used by Mark Zuckerberg's Meta to train its AI models. According to a US court filing, Zuckerberg, Meta's founder and chief executive, approved use of the LibGen dataset despite warnings within the company's AI executive team that it is a dataset "we know to be pirated". The co-founder of an AI startup wrote on LinkedIn that members of the public could in theory "create their own personal free version of Spotify". Yoav Zimmerman, co-founder of Third Chair, said it could also allow tech companies to "train on modern music at scale." He added: "The only thing stopping them is copyright law and the deterrent of enforcement." Copyright has become a battleground between artists, authors and creatives on one side and AI companies on the other. AI tools like chatbot and music generators are trained on vast amounts of data taken from the open web, including copyright-protected work. In the UK, creative professionals have protested against a government proposal to let AI companies use copyright-protected work without permission unless the owner of the copyright-protected work signals they do not want their data to be taken. Almost every respondent to a government consultation on the proposal has backed artists' concerns. Liz Kendall, the secretary of state for science, innovation and technology, told parliament this month there was "no clear consensus" on the issue, adding that ministers would "take the time to get this right". The government has pledged to make policy proposals on AI and copyright by 18 March next year.
[3]
A pirate activist group scraped and released Spotify's entire library
A pirate activist group said it 'backed up' Spotify's music catalogue, claiming it put metadata for 256 million tracks online. The streaming platform said it's 'actively monitoring' the incident. Streaming platform Spotify confirmed on Monday its library had been scraped by a third party, after a pirate activist group claimed it released metadata for the platform's entire music catalogue. According to a blog post on the open source search engine Anna's Archive, the release includes metadata for 256 million tracks and 86 million audio files, representing around 99.6 percent of listens. The files cover music that was put on the platform between 2007 and 2025, the blog post said. "It's the world's first 'preservation archive' for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space)," the blog post stated. A spokesperson from Spotify confirmed the unauthorised access of its library, adding that the third party "used illicit tactics to circumvent DRM (digital rights management) to access some of the platform's audio files". "Spotify has identified and disabled the nefarious user accounts that engaged in unlawful scraping. We've implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behaviour," the spokesperson later added in a statement to Euronews Next. The spokesperson said there is no indication of any non-public user information being compromised in the breach, and that the only user-related data involved relates to public playlists created by users. Spotify did not specify how much data was scraped. Hackers said the data was "a little under 300TB in total size" and would be distributed on peer-to-peer file-sharing networks in bulk torrents. Anna's Archive claims its mission is "preserving humanity's knowledge and culture". The search engine for "shadow libraries" has until now been focused on books and other texts. "This Spotify scrape is our humble attempt to start such a 'preservation archive' for music," the blog post states. "Of course Spotify doesn't have all the music in the world, but it's a great start." Theoretically, anyone with the technical knowledge and disk space could use the archive to create their own copy of Spotify. Realistically, anyone who tries will face swift and severe legal action from record companies and other rightsholders. One of the bigger concerns is the potential for artificial intelligence (AI) companies to use the data to train their models, according to Yoav Zimmerman, CEO of Third Chair, a company that tracks unauthorised use of intellectual property. "It also just became dramatically easier for AI companies to train on modern music at scale," Zimmerman said in a LinkedIn post. "The only thing stopping them is copyright law and the deterrent of enforcement." Spotify said it is actively working with industry partners to protect the rights of the creative community. "Since day one, we have stood with the artist community against piracy," the company shared in a statement.
[4]
Who is Anna's Archives and was Spotify user data stolen in mega heist? Data leak explained and see if stolen files were released online
Who is Anna's Archives and was Spotify user data stolen in mega heist has become a major question after claims of a large scale Spotify scraping operation. A piracy activist group known as Anna's Archives says it collected 86 million music files from Spotify, along with track metadata, totaling nearly 300TB of data. The claim has drawn attention from the music industry, artists, and listeners worldwide. Spotify, which hosts more than 100 million tracks for over 700 million users, says no user data was accessed and listening accounts remain secure. The company confirmed it disabled accounts linked to unlawful scraping and added safeguards. The incident has renewed debate around digital piracy, copyright enforcement, and the use of scraped music data. Who is Anna's Archives and was Spotify user data stolen in mega heist is a question raised after claims of a large Spotify scraping operation. A piracy activist group called Anna's Archives says it collected 86 million music files from Spotify. The group also claims the total data size reaches 300TB. Spotify is one of the largest music streaming platforms. It hosts over 100 million tracks. It serves more than 700 million users worldwide. The group claims the files represent about 99.6 percent of Spotify listens. Spotify has denied that user data was stolen. The company says the issue does not impact listeners. Spotify states that the activity involved unlawful scraping of music files and metadata. Anna's Archives is known as a piracy activist group. It has been linked to digital archiving projects in the past. In a blog post, the group said it discovered a way to scrape Spotify at scale. The group stated it wanted to create a music preservation archive. It said the archive would be open and focused on long term access. According to Anna's Archives, the scraping effort was planned and deliberate. The group claims it collected 86 million Spotify music files. It also claims to have taken metadata. Metadata includes information such as track names, artists, albums, and identifiers. The group says the full archive totals around 300TB of data. The claim that Spotify 86 million files were looted has raised concern in the music industry. The 300TB figure includes audio files and metadata. Metadata is often used by platforms to manage catalogs and recommend music. Spotify says the music catalog has not been released online. The company also says it does not believe the scraped files are publicly available at this time. Spotify confirms it took action after detecting suspicious behavior. The company disabled accounts linked to the scraping activity. Spotify says it has added new safeguards to prevent similar attempts in the future. Spotify has clearly stated that user data was not affected. The company says no listener information was accessed. This includes usernames, passwords, payment data, and listening history tied to individuals. The issue centers on scraped music files and metadata. Spotify says the scraping violated its terms and copyright rules. It says the activity does not qualify as a breach of user security. Spotify also says it actively monitors for scraping and piracy. It works with industry partners to protect artists and rights holders. Even though Spotify user data was not stolen, concerns remain. Music industry groups worry that scraped data could be used to train artificial intelligence systems. Artists may not have given consent for such use. Metadata and audio files can be valuable for AI training. This raises questions about copyright and control. Spotify says it stands with artists against piracy and unauthorized use of music. The company says it is committed to defending creator rights. It says it will continue monitoring for unlawful scraping. Spotify released a statement after the claims became public. The company said it identified and disabled accounts involved in unlawful scraping. It said new protections are now in place. Spotify said it has opposed piracy since its launch. It said it is working with partners across the music industry. The goal is to protect creators and prevent misuse of content. Spotify maintains that the platform remains secure for users. It says streaming access continues as normal. At present, there is no confirmation that the scraped Spotify files have been released. Anna's Archives has said it intends to release the archive. Spotify says it is watching the situation closely. The case highlights ongoing challenges around digital piracy. It also shows the growing concern over large scale data scraping. Music platforms continue to face pressure to protect content. Who is Anna's Archives and was Spotify user data stolen in mega heist remains a key search topic as the situation develops. Q1: Who is Anna's Archives and was Spotify user data stolen in mega heist? Anna's Archives is a piracy activist group. Spotify says user data was not stolen. The issue involves scraping music files and metadata, not personal listener information. Q2: What is the Spotify 86 million files looted and 300TB data leak claim? The group claims it scraped 86 million Spotify music files and metadata totaling 300TB. Spotify says the files are not publicly released and actions were taken.
[5]
Spotify Music Data Leak Exposes Pirated-Content AI Training Risk
MediaNama Take: In India, the Spotify scrape lands amid an unresolved legal tension between data protection and copyright enforcement. Under the Digital Personal Data Protection Act (DPDPA), publicly available personal data is exempt from several consent requirements and can be processed, including for AI training, with limited safeguards. While this carve-out may cover music metadata, it does not extend to copyrighted audio files, exposing a gap that large-scale scraping exploits. The issue is already before the Indian courts. In ANI's copyright lawsuit against OpenAI in the Delhi High Court, the news agency has challenged the use of copyrighted content for AI training without authorisation, signalling that Indian courts may take a stricter view of "lawful access" than data protection law alone suggests. Against this backdrop, a recently proposed DPIIT committee framework recommends mandatory licensing and statutory royalties for AI training, with no opt-out for creators. If adopted, it would sharply limit the legal usability of datasets derived from unauthorised scraping, such as the Spotify archive, regardless of their scale or technical availability. A piracy-linked activist group, Anna's Archive, has leaked Spotify's music data online after scraping and backing up large parts of the streaming platform's library, including metadata for around 256 million tracks and audio files linked to nearly 99.6% of all listens on Spotify. Spotify confirmed that unauthorised access took place and said it has taken action against the accounts involved. In a statement issued on December 22, the company said: "Spotify has identified and disabled the nefarious user accounts that engaged in unlawful scraping. We've implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behavior. Since day one, we have stood with the artist community against piracy, and we are actively working with our industry partners to protect creators and defend their rights." In a blog post published on December 20, Anna's Archive said it had backed up large parts of Spotify's catalogue and distributed the data through bulk torrents. According to the group's own claims, the release includes metadata for around 256 million tracks and audio files for roughly 86 million songs. The total size of the data is close to 300 terabytes. The group is releasing the files in stages, starting with metadata and then moving on to music files ranked by popularity. Spotify said that the unauthorised actors accessed some audio files but primarily scraped publicly available metadata and used 'illicit tactics' to bypass protections on certain files. The company said it is continuing to investigate the incident. The leaked material includes track-level information such as song titles, artist names, album details, popularity scores, International Standard Recording Codes (ISRC), market availability, and audio features generated by Spotify. The dataset also contains information about playlists, including playlist names, follower counts, and track listings. In addition, the group claims to have stored audio files in compressed formats, along with embedded metadata and technical identifiers. Anna's Archive calls the release a 'preservation archive' and states that it intends it for long-term storage rather than everyday use. In its blog post, the group argues that existing music archiving efforts fragment the collection or focus mainly on popular content. However, the operation involved scraping Spotify at scale and distributing copyrighted material without authorisation, which constitutes piracy under copyright law. The scale of the alleged Spotify scrape raises concerns that go beyond conventional music piracy and copyright enforcement. If large portions of Spotify's catalogue, metadata and audio files alike are circulating through torrents, it lowers the practical barrier for AI companies or independent developers to train music-related AI models using pirated content. The sheer volume of data involved also underscores the industrial scale of the leak. At nearly 300 terabytes, the dataset represents one of the largest known unauthorised disclosures in the content and media industry. To put this in perspective, storing 300 TB of data would require more than 600 laptops with 500 GB of storage each or around 2,400 smartphones with 128 GB of storage. The scale goes well beyond individual piracy or casual scraping, pointing instead to infrastructure-level data extraction with implications for copyright enforcement, platform security, and downstream reuse at scale. This concern mirrors ongoing copyright litigation in the AI sector. Earlier this year, court filings in the US revealed that Meta had torrented 81.7 of terabytes of pirated books from shadow libraries such as LibGen and Z-Library, accessed via Anna's Archive, to train its AI models. Authors have argued that this amounted to large-scale copyright infringement carried out for commercial AI development. The Spotify dataset, if used similarly, could expose music rights holders to comparable risks, particularly if AI training occurs outside regulated or licensed frameworks. With access to a dataset of this magnitude, it also becomes technically feasible for individuals or small groups to recreate private, Spotify-like music libraries using personal servers, sufficient storage, and media streaming software. In theory, such setups could host vast catalogues of music and metadata without platform-level safeguards, licensing agreements, or content moderation controls. This, in turn, raises broader questions about the downstream use of scraped cultural data. Unlike regulated streaming platforms, AI models or private systems built on pirated datasets may not incorporate copyright filters, attribution mechanisms, or content restrictions. As seen in the emergence of so-called "uncensored" or lightly moderated AI models trained on unlicensed data, the absence of safeguards can amplify legal, ethical, and economic risks for creators. While copyright law and the threat of enforcement remain the primary deterrents, the Spotify incident highlights how large-scale scraping of mainstream platforms can blur the line between piracy, archiving, and AI training, posing challenges not just for rights enforcement but for the governance of AI systems built on cultural and creative works. Spotify has not disclosed how long the scraping went undetected or whether any user data beyond music and metadata was affected. There is no indication so far that listener accounts or personal information were compromised. For now, Spotify says it has shut down the accounts linked to the scraping and strengthened its safeguards. Whether the leaked data can be fully contained remains unclear, given that it is already circulating on peer-to-peer networks.
Share
Share
Copy Link
A piracy activist group called Anna's Archive claims it scraped 86 million music files and 256 million rows of metadata from Spotify, totaling nearly 300 terabytes. The shadow library says it's building a music preservation archive, but experts warn the dataset could fuel AI training on pirated content. Spotify confirmed unauthorized access and disabled accounts involved.
Anna's Archive, a piracy activist group known for providing access to pirated books, shocked the internet this weekend by announcing it had "backed up Spotify" and begun distributing nearly 300 terabytes of data through bulk torrents
1
. The Spotify music data leak includes metadata for approximately 256 million tracks and audio files for roughly 86 million songs, representing about 99.6 percent of all listens on the platform2
. Spotify, which hosts more than 100 million tracks for over 700 million users worldwide, confirmed the unauthorized access on Monday and said it is "actively investigating the incident"3
.
Source: ET
The Stockholm-based streaming giant told reporters that "a third party scraped public metadata and used illicit tactics to circumvent DRM to access some of the platform's audio files"
2
. The company has identified and disabled the accounts that engaged in unlawful scraping and implemented new safeguards for these types of anti-copyright attacks3
. Importantly, Spotify emphasized that no user data was stolen and that the only user-related information involved relates to public playlists created by users4
.
Source: Euronews
Anna's Archive claims its mission centers on building a music preservation archive to protect "humanity's musical heritage" from "destruction by natural disasters, wars, budget cuts, and other catastrophes"
1
. The group said it discovered "a way to scrape Spotify at scale" some time ago and saw an opportunity to create "the world's first 'preservation archive' for music which is fully open"3
. The scraped music files and metadata cover tracks uploaded to the platform between 2007 and 2025, with files prioritized by popularity1
.
Source: MediaNama
However, observers and industry experts express deep skepticism about the group's stated preservation motives. Ed Newton-Rex, a composer and campaigner for protecting artist consent, told The Guardian that "training on pirated material is sadly common in the AI industry, so this stolen music is almost certain to end up training AI models"
2
. The concern intensifies when considering that Anna's Archive promotes selling "high-speed access" to "enterprise-level" LLM data, including "unreleased collections," with interested AI researchers encouraged to reach out about collaboration1
.The Anna's Archive site makes explicit references to LibGen, a vast online archive of pirated content that has allegedly been used by Meta to train its AI models
2
. According to US court filings, Meta's founder Mark Zuckerberg approved use of the LibGen dataset despite warnings within the company's AI executive team that it is "a dataset we know to be pirated"2
. Earlier revelations showed Meta had torrented 81.7 terabytes of pirated books from shadow libraries such as LibGen and Z-Library, accessed via Anna's Archive, to train its AI models5
.Yoav Zimmerman, co-founder of Third Chair, a company that tracks unauthorized use of intellectual property, noted on LinkedIn that members of the public could theoretically "create their own personal free version of Spotify" using the archive
2
. More significantly, he observed that "it also just became dramatically easier for AI companies to train on modern music at scale," with copyright law and the deterrent of enforcement being the only obstacles3
.The dataset's metadata includes track-level information such as song titles, artist names, album details, popularity scores, International Standard Recording Codes (ISRC), market availability, and audio features generated by Spotify
5
. It also contains information about playlists, including playlist names, follower counts, and track listings, along with audio files in compressed formats with embedded metadata and technical identifiers5
.At nearly 300 terabytes, the dataset represents one of the largest known unauthorized disclosures in the content and media industry. To put this in perspective, storing 300TB of data would require more than 600 laptops with 500GB of storage each or around 2,400 smartphones with 128GB of storage
5
. This scale goes well beyond individual digital piracy or casual data scraping, pointing instead to infrastructure-level data extraction with implications for copyright enforcement, platform security, and downstream reuse at scale.Related Stories
Reaction from Anna's Archive's own user base has been mixed, with many expressing alarm that the group may have overreached. On Hacker News, users questioned whether the data would be useful to anyone but AI researchers, since searching bulk torrents for individual songs seemed impractical for music fans
1
. One top commenter wrote: "This is insane. Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff"1
.On Reddit, some users fretted that Anna's Archive may have doomed itself by scraping the data, with one writing: "I'm furious with AA for sticking this target on their own backs," referencing how the Internet Archive struggled to survive a legal attack from record labels that ended in a confidential settlement last year
1
. The concern centers on whether the group, which many users rely on for accessing books and academic papers, has made itself vulnerable to legal action from rightsholders in the music industry.The incident arrives as copyright law and AI training data remain contested territory globally. In the UK, creative professionals have protested against a government proposal to let AI companies use copyright-protected work without permission unless the owner explicitly opts out
2
. Almost every respondent to a government consultation on the proposal has backed artists' concerns, and the government has pledged to make policy proposals on AI and copyright by March 18 next year2
.In India, the situation lands amid unresolved legal tension between data protection and copyright enforcement. Under the Digital Personal Data Protection Act (DPDPA), publicly available personal data is exempt from several consent requirements and can be processed, including for AI training, with limited safeguards
5
. However, this carve-out does not extend to copyrighted audio files, exposing a gap that large-scale scraping exploits. A recently proposed DPIIT committee framework recommends mandatory licensing and statutory royalties for AI training, with no opt-out for creators5
.Spotify maintains it has "stood with the artist community against piracy" since day one and is actively working with industry partners to protect creators and defend their rights
3
. The company says it is monitoring for suspicious behavior and has implemented new safeguards, though it remains unclear whether legal action will follow to take down the torrents1
. For now, the creative community watches closely as this latest chapter in the battle over pirated content and AI models unfolds.Summarized by
Navi
[2]
16 Oct 2025•Technology

25 Sept 2025•Technology

08 Dec 2025•Entertainment and Society

1
Policy and Regulation

2
Technology

3
Technology
