Bluesky's Open API Sparks AI Training Controversy: One Million Posts Scraped Without User Consent

Curated by THEOUTPOST

On Thu, 28 Nov, 12:02 AM UTC

6 Sources

Share

Bluesky, a decentralized social network, faces privacy concerns after one million public posts were scraped for AI training, highlighting the platform's vulnerability to data collection despite its stance against using user data for AI.

Bluesky's Data Scraping Incident

Bluesky, a decentralized social network gaining popularity as an alternative to X (formerly Twitter), has found itself at the center of a privacy controversy. A dataset containing one million public posts from the platform was scraped and uploaded to AI company Hugging Face, intended for machine learning research 1. This incident has raised significant concerns about user privacy and consent in the age of AI.

The Dataset and Its Creator

The dataset was compiled by Daniel van Strien, a machine learning librarian at Hugging Face. It included not only the text of posts but also users' decentralized identifiers (DIDs) and metadata 2. Van Strien's intention was to use this data for research related to natural language processing, social media analysis, and content moderation.

Bluesky's Open API and Vulnerability

The core issue stems from Bluesky's open and decentralized nature, built on the Authenticated Transfer (AT) Protocol. The platform's Firehose API provides an aggregated stream of public data updates, making it vulnerable to external scrapers 3. This openness, while beneficial for third-party developers, has exposed a significant privacy risk for users.

User Consent and Platform Response

Bluesky users did not provide explicit permission for their posts to be used in this manner. The platform has stated that it will never train generative AI on user data, but it acknowledges that it cannot enforce this policy outside its systems 4. Bluesky is now exploring ways to enable users to communicate their consent preferences to external parties, though enforcement will ultimately depend on outside developers.

Aftermath and Apology

Following the backlash, van Strien removed the dataset from Hugging Face and issued a public apology, acknowledging the breach of transparency and consent in his data collection approach 5. This incident has sparked a broader discussion about data privacy and the ethical use of public information for AI training.

Implications for Social Media Users

This controversy serves as a reminder that content shared publicly on platforms like Bluesky is accessible to external entities. As Bluesky continues to grow, surpassing 20 million users, it will likely face increasing scrutiny regarding its data protection measures and user privacy policies.

Broader Context of AI Training and User Data

The incident highlights a growing trend in the tech industry where user-generated content is being used to train AI models. Other platforms like X and Meta have updated their terms of service to allow for such practices, while LinkedIn has introduced an opt-out option for users who don't want their data used for AI training.

Future Considerations

As AI technology advances, the balance between open platforms, user privacy, and ethical AI training becomes increasingly complex. Bluesky's situation underscores the need for clearer policies, user consent mechanisms, and industry-wide standards for the responsible use of public data in AI development.

Continue Reading
Bluesky Pledges Not to Train AI on User Content,

Bluesky Pledges Not to Train AI on User Content, Contrasting with X's Policy

Bluesky, a rising social media platform, has announced it will not use user content to train generative AI models, setting itself apart from competitors like X (formerly Twitter) and attracting privacy-conscious users.

NDTV Gadgets 360 logoPC Magazine logoengadget logoTechCrunch logo

7 Sources

NDTV Gadgets 360 logoPC Magazine logoengadget logoTechCrunch logo

7 Sources

X (Formerly Twitter) to Allow Third-Party AI Training on

X (Formerly Twitter) to Allow Third-Party AI Training on User Data, Raising Privacy Concerns

X, the social media platform formerly known as Twitter, has updated its privacy policy to allow third-party collaborators to use user data for AI training purposes, sparking debates about user privacy and data rights.

Lifehacker logoTechCrunch logoMashable logoSilicon Republic logo

8 Sources

Lifehacker logoTechCrunch logoMashable logoSilicon Republic logo

8 Sources

LinkedIn Halts AI Data Processing in UK Amid Privacy

LinkedIn Halts AI Data Processing in UK Amid Privacy Concerns

LinkedIn has stopped collecting UK users' data for AI training following regulatory scrutiny. This move highlights growing concerns over data privacy and the need for transparent AI practices in tech companies.

TechCrunch logoThe Hacker News logoBBC logoTechRadar logo

8 Sources

TechCrunch logoThe Hacker News logoBBC logoTechRadar logo

8 Sources

LinkedIn's AI Training on User Data Raises Privacy Concerns

LinkedIn's AI Training on User Data Raises Privacy Concerns and Opt-Out Debate

LinkedIn, with its 930 million users, is using member data to train AI models, sparking a debate on data privacy and the need for transparent opt-out options. This practice has raised concerns among privacy advocates and users alike.

PYMNTS.com logoThe Seattle Times logoWashington Post logoFortune logo

4 Sources

PYMNTS.com logoThe Seattle Times logoWashington Post logoFortune logo

4 Sources

Meta Under Investigation for AI Training Data Practices in

Meta Under Investigation for AI Training Data Practices in Australia

Meta faces scrutiny from Australian authorities over its use of user data for AI training. The company has admitted to scraping posts and photos from Facebook users since 2007 without explicit consent, raising privacy concerns.

Analytics Insight logoTechRadar logoengadget logoPetaPixel logo

8 Sources

Analytics Insight logoTechRadar logoengadget logoPetaPixel logo

8 Sources

TheOutpost.ai

Your one-stop AI hub

The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.

© 2025 TheOutpost.AI All rights reserved