Anna's Archive scrapes 300TB of Spotify data, sparking fears of AI training misuse

Reviewed byNidhi Govil

5 Sources

Share

A piracy activist group called Anna's Archive claims it scraped 86 million music files and 256 million rows of metadata from Spotify, totaling nearly 300 terabytes. The shadow library says it's building a music preservation archive, but experts warn the dataset could fuel AI training on pirated content. Spotify confirmed unauthorized access and disabled accounts involved.

Shadow Library Targets Spotify in Massive Data Scrape

Anna's Archive, a piracy activist group known for providing access to pirated books, shocked the internet this weekend by announcing it had "backed up Spotify" and begun distributing nearly 300 terabytes of data through bulk torrents

1

. The Spotify music data leak includes metadata for approximately 256 million tracks and audio files for roughly 86 million songs, representing about 99.6 percent of all listens on the platform

2

. Spotify, which hosts more than 100 million tracks for over 700 million users worldwide, confirmed the unauthorized access on Monday and said it is "actively investigating the incident"

3

.

Source: ET

Source: ET

The Stockholm-based streaming giant told reporters that "a third party scraped public metadata and used illicit tactics to circumvent DRM to access some of the platform's audio files"

2

. The company has identified and disabled the accounts that engaged in unlawful scraping and implemented new safeguards for these types of anti-copyright attacks

3

. Importantly, Spotify emphasized that no user data was stolen and that the only user-related information involved relates to public playlists created by users

4

.

Source: Euronews

Source: Euronews

Music Preservation Archive or AI Training Pipeline?

Anna's Archive claims its mission centers on building a music preservation archive to protect "humanity's musical heritage" from "destruction by natural disasters, wars, budget cuts, and other catastrophes"

1

. The group said it discovered "a way to scrape Spotify at scale" some time ago and saw an opportunity to create "the world's first 'preservation archive' for music which is fully open"

3

. The scraped music files and metadata cover tracks uploaded to the platform between 2007 and 2025, with files prioritized by popularity

1

.

Source: MediaNama

Source: MediaNama

However, observers and industry experts express deep skepticism about the group's stated preservation motives. Ed Newton-Rex, a composer and campaigner for protecting artist consent, told The Guardian that "training on pirated material is sadly common in the AI industry, so this stolen music is almost certain to end up training AI models"

2

. The concern intensifies when considering that Anna's Archive promotes selling "high-speed access" to "enterprise-level" LLM data, including "unreleased collections," with interested AI researchers encouraged to reach out about collaboration

1

.

LibGen Parallels Raise Copyright Enforcement Questions

The Anna's Archive site makes explicit references to LibGen, a vast online archive of pirated content that has allegedly been used by Meta to train its AI models

2

. According to US court filings, Meta's founder Mark Zuckerberg approved use of the LibGen dataset despite warnings within the company's AI executive team that it is "a dataset we know to be pirated"

2

. Earlier revelations showed Meta had torrented 81.7 terabytes of pirated books from shadow libraries such as LibGen and Z-Library, accessed via Anna's Archive, to train its AI models

5

.

Yoav Zimmerman, co-founder of Third Chair, a company that tracks unauthorized use of intellectual property, noted on LinkedIn that members of the public could theoretically "create their own personal free version of Spotify" using the archive

2

. More significantly, he observed that "it also just became dramatically easier for AI companies to train on modern music at scale," with copyright law and the deterrent of enforcement being the only obstacles

3

.

Scale of Digital Piracy Reaches Industrial Proportions

The dataset's metadata includes track-level information such as song titles, artist names, album details, popularity scores, International Standard Recording Codes (ISRC), market availability, and audio features generated by Spotify

5

. It also contains information about playlists, including playlist names, follower counts, and track listings, along with audio files in compressed formats with embedded metadata and technical identifiers

5

.

At nearly 300 terabytes, the dataset represents one of the largest known unauthorized disclosures in the content and media industry. To put this in perspective, storing 300TB of data would require more than 600 laptops with 500GB of storage each or around 2,400 smartphones with 128GB of storage

5

. This scale goes well beyond individual digital piracy or casual data scraping, pointing instead to infrastructure-level data extraction with implications for copyright enforcement, platform security, and downstream reuse at scale.

Community Backlash and Legal Risks

Reaction from Anna's Archive's own user base has been mixed, with many expressing alarm that the group may have overreached. On Hacker News, users questioned whether the data would be useful to anyone but AI researchers, since searching bulk torrents for individual songs seemed impractical for music fans

1

. One top commenter wrote: "This is insane. Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff"

1

.

On Reddit, some users fretted that Anna's Archive may have doomed itself by scraping the data, with one writing: "I'm furious with AA for sticking this target on their own backs," referencing how the Internet Archive struggled to survive a legal attack from record labels that ended in a confidential settlement last year

1

. The concern centers on whether the group, which many users rely on for accessing books and academic papers, has made itself vulnerable to legal action from rightsholders in the music industry.

What This Means for AI Training and Data Protection

The incident arrives as copyright law and AI training data remain contested territory globally. In the UK, creative professionals have protested against a government proposal to let AI companies use copyright-protected work without permission unless the owner explicitly opts out

2

. Almost every respondent to a government consultation on the proposal has backed artists' concerns, and the government has pledged to make policy proposals on AI and copyright by March 18 next year

2

.

In India, the situation lands amid unresolved legal tension between data protection and copyright enforcement. Under the Digital Personal Data Protection Act (DPDPA), publicly available personal data is exempt from several consent requirements and can be processed, including for AI training, with limited safeguards

5

. However, this carve-out does not extend to copyrighted audio files, exposing a gap that large-scale scraping exploits. A recently proposed DPIIT committee framework recommends mandatory licensing and statutory royalties for AI training, with no opt-out for creators

5

.

Spotify maintains it has "stood with the artist community against piracy" since day one and is actively working with industry partners to protect creators and defend their rights

3

. The company says it is monitoring for suspicious behavior and has implemented new safeguards, though it remains unclear whether legal action will follow to take down the torrents

1

. For now, the creative community watches closely as this latest chapter in the battle over pirated content and AI models unfolds.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo