Bluesky's Open API Sparks AI Training Controversy: One Million Posts Scraped Without User Consent

6 Sources

Bluesky, a decentralized social network, faces privacy concerns after one million public posts were scraped for AI training, highlighting the platform's vulnerability to data collection despite its stance against using user data for AI.

News article

Bluesky's Data Scraping Incident

Bluesky, a decentralized social network gaining popularity as an alternative to X (formerly Twitter), has found itself at the center of a privacy controversy. A dataset containing one million public posts from the platform was scraped and uploaded to AI company Hugging Face, intended for machine learning research 1. This incident has raised significant concerns about user privacy and consent in the age of AI.

The Dataset and Its Creator

The dataset was compiled by Daniel van Strien, a machine learning librarian at Hugging Face. It included not only the text of posts but also users' decentralized identifiers (DIDs) and metadata 2. Van Strien's intention was to use this data for research related to natural language processing, social media analysis, and content moderation.

Bluesky's Open API and Vulnerability

The core issue stems from Bluesky's open and decentralized nature, built on the Authenticated Transfer (AT) Protocol. The platform's Firehose API provides an aggregated stream of public data updates, making it vulnerable to external scrapers 3. This openness, while beneficial for third-party developers, has exposed a significant privacy risk for users.

User Consent and Platform Response

Bluesky users did not provide explicit permission for their posts to be used in this manner. The platform has stated that it will never train generative AI on user data, but it acknowledges that it cannot enforce this policy outside its systems 4. Bluesky is now exploring ways to enable users to communicate their consent preferences to external parties, though enforcement will ultimately depend on outside developers.

Aftermath and Apology

Following the backlash, van Strien removed the dataset from Hugging Face and issued a public apology, acknowledging the breach of transparency and consent in his data collection approach 5. This incident has sparked a broader discussion about data privacy and the ethical use of public information for AI training.

Implications for Social Media Users

This controversy serves as a reminder that content shared publicly on platforms like Bluesky is accessible to external entities. As Bluesky continues to grow, surpassing 20 million users, it will likely face increasing scrutiny regarding its data protection measures and user privacy policies.

Broader Context of AI Training and User Data

The incident highlights a growing trend in the tech industry where user-generated content is being used to train AI models. Other platforms like X and Meta have updated their terms of service to allow for such practices, while LinkedIn has introduced an opt-out option for users who don't want their data used for AI training.

Future Considerations

As AI technology advances, the balance between open platforms, user privacy, and ethical AI training becomes increasingly complex. Bluesky's situation underscores the need for clearer policies, user consent mechanisms, and industry-wide standards for the responsible use of public data in AI development.

Explore today's top stories

NVIDIA Unveils Major GeForce NOW Upgrade with RTX 5080 Performance and Expanded Game Library

NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.

CNET logoengadget logoPCWorld logo

9 Sources

Technology

6 hrs ago

NVIDIA Unveils Major GeForce NOW Upgrade with RTX 5080

Space: The New Frontier of 21st Century Warfare

As nations compete for dominance in space, the risk of satellite hijacking and space-based weapons escalates, transforming outer space into a potential battlefield with far-reaching consequences for global security and economy.

AP NEWS logoTech Xplore logoeuronews logo

7 Sources

Technology

22 hrs ago

Space: The New Frontier of 21st Century Warfare

OpenAI Tweaks GPT-5 to Be 'Warmer and Friendlier' Amid User Backlash

OpenAI updates GPT-5 to make it more approachable following user feedback, sparking debate about AI personality and user preferences.

ZDNet logoTom's Guide logoFuturism logo

6 Sources

Technology

14 hrs ago

OpenAI Tweaks GPT-5 to Be 'Warmer and Friendlier' Amid User

Russian Disinformation Campaign Exploits AI to Spread Fake News

A pro-Russian propaganda group, Storm-1679, is using AI-generated content and impersonating legitimate news outlets to spread disinformation, raising concerns about the growing threat of AI-powered fake news.

Rolling Stone logoBenzinga logo

2 Sources

Technology

22 hrs ago

Russian Disinformation Campaign Exploits AI to Spread Fake

AI in Healthcare: Patients Trust AI Medical Advice Over Doctors, Raising Concerns and Challenges

A study reveals patients' increasing reliance on AI for medical advice, often trusting it over doctors. This trend is reshaping doctor-patient dynamics and raising concerns about AI's limitations in healthcare.

ZDNet logoMedscape logoEconomic Times logo

3 Sources

Health

14 hrs ago

AI in Healthcare: Patients Trust AI Medical Advice Over
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo