Anna's Archive scrapes 86 million music files from Spotify, raising AI training data concerns

Reviewed byNidhi Govil

2 Sources

Share

A pirate activist group called Anna's Archive claims to have scraped 86 million music files and 256 million rows of metadata from Spotify, representing 99.6% of all music listened to on the platform. The Stockholm-based streaming giant confirmed the unauthorized access and disabled the accounts involved, while experts warn the leaked material could become AI training data for music generators.

News article

Spotify Confirms Unauthorized Access to Music Catalog

Spotify has confirmed that a pirate activist group successfully executed unauthorized access to its platform, scraping 86 million music files and 256 million rows of metadata from the streaming service. Anna's Archive, a group previously known for providing links to shadow libraries of pirated books, claimed responsibility for the breach in a blog post describing the effort as creating a "preservation archive" for music

1

. The Stockholm-based company, which hosts more than 100 million tracks and serves over 700 million users worldwide, said it had "identified and disabled the nefarious user accounts that engaged in unlawful scraping"

1

.

The breach does not represent Spotify's entire inventory, but the scraped music files cover approximately 99.6% of all music listened to by Spotify users

2

. According to Anna's Archive, the files span music uploaded to the platform between 2007 and 2025, totaling "a little under 300TB in total size"

2

. The group plans to distribute the scraped music files through torrents, a peer-to-peer file-sharing method that allows anyone with sufficient disk space to mirror the entire dataset.

Digital Rights Management Circumvented in Breach

Spotify's investigation revealed that the third party "used illicit tactics to circumvent DRM [Digital Rights Management] to access some of the platform's audio files"

2

. The company emphasized that no non-public user information was compromised, with the only user-related data involved relating to public playlists created by users. Spotify stated it has "implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behaviour"

2

.

Anna's Archive describes its mission as "preserving humanity's knowledge and culture" and claims this Spotify scrape represents "the world's first 'preservation archive' for music which is fully open"

2

. The group stated: "With your help, humanity's musical heritage will be forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes"

1

. While theoretically anyone with technical knowledge could use the archive to create their own copy of Spotify's music catalog, such attempts would face swift legal action from record companies and other rightsholders.

AI Training Data Concerns Emerge

The most pressing concern for the creative community centers on how this massive dataset could fuel training AI models for music generators. Ed Newton-Rex, a composer and campaigner for protecting artists' copyright, warned that "training on pirated material is sadly common in the AI industry, so this stolen music is almost certain to end up training AI models"

1

. He emphasized the urgent need for data transparency, stating: "This is why governments must insist AI companies reveal the training data they use"

1

.

Yoav Zimmerman, CEO of Third Chair, a company tracking unauthorized use of intellectual property, noted in a LinkedIn post that "it also just became dramatically easier for AI companies to train on modern music at scale"

2

. He added that members of the public could theoretically "create their own personal free version of Spotify," but emphasized that "the only thing stopping them is copyright law and the deterrent of enforcement"

1

.

Copyright Infringement Battle Intensifies

The Anna's Archive site makes references to LibGen, a vast online archive of pirated books that has allegedly been used by tech giants for AI development. According to a US court filing, Mark Zuckerberg's Meta used the LibGen dataset to train its AI models despite internal warnings that it is "a dataset we know to be pirated"

1

. This precedent raises alarms about how the Spotify breach could follow a similar trajectory.

Copyright has become a battleground between the creative community and AI companies, with AI tools like chatbots and music generators trained on vast amounts of data taken from the open web, including copyright-protected work. In the UK, creative professionals have protested against a government proposal to let AI companies use copyright-protected work without permission unless owners explicitly opt out. Liz Kendall, the secretary of state for science, innovation and technology, told parliament this month there was "no clear consensus" on the issue, with the government pledging to make policy proposals on AI and copyright by 18 March next year

1

.

Spotify emphasized it is "actively working with industry partners to protect the rights of the creative community," stating: "Since day one, we have stood with the artist community against piracy"

2

. The incident underscores the urgent need for stronger protections as scraping techniques become more sophisticated and the appetite for AI training data grows among tech companies.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo