Curated by THEOUTPOST
On Thu, 28 Nov, 12:02 AM UTC
6 Sources
[1]
Your Bluesky posts might be training AI
Bluesky is grappling with a significant privacy issue after one million public posts were scraped from its platform for AI training, according to a 404Media report. The dataset, compiled by machine learning librarian Daniel van Strien from the AI company Hugging Face, was intended for use in research related to natural language processing and social media analysis. Although Bluesky's representatives assert that the platform will never train generative AI on user data, the open nature of its API makes it vulnerable to external scrapers. The dataset in question was sourced through Bluesky's Firehose API, which provides an aggregated stream of public data updates, including posts, likes, and follows. Van Strien had aimed to use this dataset for pushing forward machine learning research. However, it not only included the text of posts but also users' decentralized identifiers (DIDs) and metadata. After media reports highlighted the issue, the dataset was swiftly removed from Hugging Face due to the backlash it generated regarding user privacy and lack of consent. Bluesky users did not provide explicit permission for their posts to be utilized in this manner, though Bluesky's policies do not categorically prohibit such actions. The core of the controversy lies in the open structure of Bluesky's API, which allows third-party developers to access its public data freely. According to a statement from a Bluesky representative, "we'd like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this," indicating an effort to enhance user control over data sharing in the future. Bluesky gains 1.25 million users post-election surge Following the removal of the dataset, van Strien acknowledged the breach of transparency and consent in his data collection approach. "I apologize for this mistake," he stated in a follow-up post on Bluesky. This incident serves as a prompt for users to understand better that any content shared publicly on the platform is accessible to external entities. As the platform continues to grow -- recently surpassing 20 million users -- Bluesky will likely face increasing scrutiny regarding its data protection measures and user privacy. Bluesky is currently in discussions about mechanisms that could enable users to express their consent preferences to third parties. However, enforcement remains a challenge; as noted by the platform, it will ultimately be up to outside developers to adhere to these preferences. Bluesky's representatives additionally conveyed that while they aim for discussions with engineers and legal teams, no immediate solutions are available.
[2]
One million public Bluesky posts scraped for AI training
Bluesky is already facing its first major AI scrape, despite the stance of its owners that it will never train generative AI on user data. Reported by 404Media on Nov. 26, one million public Bluesky posts -- complete with identifying user information -- were crawled and then uploaded to AI company Hugging Face. The dataset was created by machine learning librarian Daniel van Strien, intended to be used in the development of language models and natural language processing, as well as general analysis of social media trends, content moderation, and posting patterns. It contains users' decentralized identifiers (DIDs) and even has a search function to find content from specific users. According to the dataset's description, the set "contains 1 million public posts collected from Bluesky Social's firehose API (Application Programming Interface), intended for machine learning research and experimentation with social media data. Each post contains text content, metadata, and information about media attachments and reply relationships." Bluesky users didn't opt-in to such uses of their content, but neither is it expressly prohibited by Bluesky. The platform's firehose API is an "aggregated, chronological stream of all the public data updates as they happen in the network, including posts, likes, follows, handle changes, and more." Bluesky's API -- coupled with the public and decentralized Authenticated Transfer (AT) Protocol the site is built on -- means Bluesky content is open and available to the third party developers the platform is trying to court, 404Media explains. This could be a major warning sign to many of the site's millions of new users, many of whom left competitor X in the wake of an alarming new AI training policy. A Bluesky representative responded to 404Media's requests for comment: "Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here. We'd like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we're actively discussing how to achieve this." Shortly after the article's publication, the dataset was removed from Hugging Face. "I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake," van Strien wrote in a follow-up Bluesky post.
[3]
In Bluesky, they also use your data to train AI, even though the company claims not to do so. - Softonic
A researcher published a dataset with scraped data from Bluesky users, but has already deleted it. Bluesky, the decentralized social network, is at the center of controversy following the recent publication of a dataset on Hugging Face, a community platform for artificial intelligence. According to 404 Media, the dataset contained one million posts along with user information, obtained by researcher Daniel van Strien through a technique known as scraping, using the Firehose API. Van Strien justified the use of the data to "develop artificial intelligence models, analyze trends in social networks, and study posting patterns," although he ended up deleting the dataset after realizing that "this approach violated the principles of transparency and consent in data collection." The dataset included sensitive metadata, such as users' decentralized identifiers (DIDs) and specific search tools, which raised concerns for many about the possible misuse of such information. Although Bluesky assures that it does not train AI models with its users' data, it admits that "it cannot enforce this policy outside of our systems" and that the decision lies with external developers. The company also promises to continue working with engineers and lawyers to address the issue. And the open and decentralized nature of Bluesky, based on the Authenticated Transfer (AT) protocol, facilitates third parties to access content publicly. An approach that contrasts with platforms like Twitter, where Elon Musk restricted and increased the cost of access to its API to supposedly curb indiscriminate scraping.
[4]
Bluesky's open API means anyone can scrape your data for AI training | TechCrunch
Per a report by 404 Media, a machine learning librarian at AI firm Hugging Face pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Daniel van Strien later removed the data due to the controversy that ensued, however it serves as a timely reminder that everything you post publicly to Bluesky is, well, public. Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, though it's up to those parties whether they respect those preferences. The company posted: "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!" What's clear here is that while Bluesky is surging in popularity, its rapid rise to the forefront of the global consciousness will mean it's subject to the same levels of scrutiny as other major social platforms.
[5]
Think Twice Before Joining Bluesky
Bluesky offers an open API, allowing its data to be used for training AI models. Since the election day in the US, Bluesky, a microblogging alternative to X (formerly Twitter), has been rapidly gaining popularity. The user base has doubled since September to reach 20 million by November 20. The platform is competing against Elon Musk's X, which has approximately 611 million monthly active users and Meta's Threads, which boasts 275 million monthly active users. Musk's ownership of X and his close alliance with President-elect Donald Trump have made many users uncomfortable, which could be a reason behind people leaving the platform. A report estimates that around 115,000 X accounts were deactivated in the US the day after the ballot. Trump is reportedly considering appointing an AI czar under Musk's guidance to oversee federal policies and the government's use of artificial intelligence. However, unlike X, Bluesky offers an open API, allowing its data to be used for training AI models. Daniel van Strien, a machine learning engineer at Hugging Face, recently released a dataset containing one million public posts sourced from Bluesky's Firehose API. The dataset included text, metadata, and language predictions. However, he faced backlash over the lack of user consent. "Hi, I do not consent for my posts or content to be used for AI purposes in any way, shape, or form for ethical reasons. Can you withdraw my account from your data scraping, please?" posted a user on Bluesky. Another one posted, "You've started a social trend of bad actors using the API to deliberately create antagonistic bsky datasets on Hugging Face (i.e., 'two-million-bluesky-posts' repo)." These were just a few of many such posts. Strien eventually deleted the dataset and issued a public apology. "I've removed the Bluesky data from the repo. While my goal was to aid tool development for the platform, I understand this violated principles of transparency and consent. I sincerely apologise for this mistake," he said on Bluesky. With the situation escalating, Clem Delangue, CEO of Hugging Face, responded on X: "Surprisingly (or maybe not), it looks like there are a lot of toxic users on Bluesky. One of our team members made a mistake, and the reactions we're getting are just awful (but also funny tbh). Let's keep working on more positive public conversation spaces maybe?" However, Bluesky itself doesn't use user content to train its models. "A number of artists and creators have made their home on Bluesky, and we hear their concerns with other platforms training on their data. We do not use any of your content to train generative AI, and have no intention of doing so," the company said in a post. After the Hugging Face incident, Bluesky clarified that it is an open and public social network, much like websites on the internet itself. However, websites can specify whether they consent to outside companies crawling their data through a robots.txt file. Bluesky said that they are trying to introduce a similar practice. On November 15, X updated its terms of service. The new terms state that when you upload content (like text, images, etc.), you permit X to use it for analysis. This includes using your content to help train machine learning and artificial intelligence models. This change was one of the factors that led users to migrate to Bluesky. Interestingly, Musk's xAI is planning to launch its own Grok standalone app in December. Similarly, Meta's updated privacy policy specifies that Meta trains its models using users' posts, photos, and captions. "We do not use the content of your private messages with friends and family to train our AIs unless you or someone in the chat chooses to share those messages with our AIs," the company states. Microsoft-owned LinkedIn recently introduced a new privacy setting that automatically enrols users in AI model training. On September 18, LinkedIn updated its privacy policy to state that user data could be used to develop and train AI models. However, users can opt out by going to the data privacy tab in their account settings and disabling the 'Data for Generative AI Improvement' toggle. This opt-out only applies to future data use and does not affect any training already conducted. Does it Matter? Startups like OpenAI and Anthropic have already exhausted human-generated content to train their models and now rely on synthetic data for their upcoming frontier models. However, asking for user consent when using their data is still essential, and there is no excuse for bypassing it.For instance, in India, Sarvam AI is using synthetic data created by Meta Llama 3.1 405B to train its model. OpenAI reportedly uses Strawberry (o1) to generate synthetic data for GPT-5. This sets up a 'recursive improvement cycle,' where each GPT version (e.g., GPT-5 or GPT-6) is trained on higher-quality synthetic data created by the previous model.
[6]
Bluesky dataset for AI training removed from Hugging Face
The creator of the dataset issued an apology to concerned users in a post on Bluesky. A dataset of 1m Bluesky posts that was uploaded to machine learning platform Hugging Face earlier this week has been removed. On 26 November, Daniel van Strien, a machine learning librarian at Hugging Face, uploaded a dataset of 1m public posts and accompanying metadata taken from Bluesky's firehose API. The dataset card explained it was "intended for machine learning research and experimentation with social media data". However, after facing a backlash, van Strien removed the Bluesky data and apologised yesterday (27 November). "I've removed the Bluesky data from the repo," van Strien posted on Bluesky. "While I wanted to support tool development for the platform, I recognise this approach violated principles of transparency and consent in data collection. I apologise for this mistake." He said that he has left the public repository (which the dataset was posted to) online so that users can continue to give feedback. As noted by 404 Media, the data wasn't anonymous, with each post listed alongside the user's decentralised identifier. While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use. Discourse over data There has been an increased amount of discourse about the use of people's data for artificial intelligence (AI) training without users' consent. X, the social media site which rivals Bluesky, found itself in hot water earlier this year when a security expert said that Elon Musk's X is "overstepping boundaries of digital ownership" by defaulting users into allowing their posts, interactions and even conversations to be shared with its AI chatbot Grok for the purpose of AI development. A month later, the Irish Data Protection Commission (DPC) said that X had decided to suspend processing personal data of EU users to train Grok after the commission had taken legal action against it. Meta, the parent company of WhatsApp, Facebook and Instagram, also faced complaints regarding its plans to use personal data for AI earlier in the year. Instead of asking users for their consent, Meta argued that it had a "legitimate interest" to collect and process this data. The company used this same legal basis for its personalised advertising policies, but this basis was rejected by the European Court of Justice last year. Earlier this month, Bluesky experienced a surge of new users, which occurred following a mass exodus of users from X, and even led to a brief outage for the site. Open-source champion Kelsey Hightower, best known for his work with Kubernetes and Google, spoke to SiliconRepublic.com about the promise of Bluesky as a decentralised platform. He said that we have been presented with a new opportunity to get social media right but added that we all have a responsibility to ensure that this happens. Don't miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic's digest of need-to-know sci-tech news.
Share
Share
Copy Link
Bluesky, a decentralized social network, faces privacy concerns after one million public posts were scraped for AI training, highlighting the platform's vulnerability to data collection despite its stance against using user data for AI.
Bluesky, a decentralized social network gaining popularity as an alternative to X (formerly Twitter), has found itself at the center of a privacy controversy. A dataset containing one million public posts from the platform was scraped and uploaded to AI company Hugging Face, intended for machine learning research 1. This incident has raised significant concerns about user privacy and consent in the age of AI.
The dataset was compiled by Daniel van Strien, a machine learning librarian at Hugging Face. It included not only the text of posts but also users' decentralized identifiers (DIDs) and metadata 2. Van Strien's intention was to use this data for research related to natural language processing, social media analysis, and content moderation.
The core issue stems from Bluesky's open and decentralized nature, built on the Authenticated Transfer (AT) Protocol. The platform's Firehose API provides an aggregated stream of public data updates, making it vulnerable to external scrapers 3. This openness, while beneficial for third-party developers, has exposed a significant privacy risk for users.
Bluesky users did not provide explicit permission for their posts to be used in this manner. The platform has stated that it will never train generative AI on user data, but it acknowledges that it cannot enforce this policy outside its systems 4. Bluesky is now exploring ways to enable users to communicate their consent preferences to external parties, though enforcement will ultimately depend on outside developers.
Following the backlash, van Strien removed the dataset from Hugging Face and issued a public apology, acknowledging the breach of transparency and consent in his data collection approach 5. This incident has sparked a broader discussion about data privacy and the ethical use of public information for AI training.
This controversy serves as a reminder that content shared publicly on platforms like Bluesky is accessible to external entities. As Bluesky continues to grow, surpassing 20 million users, it will likely face increasing scrutiny regarding its data protection measures and user privacy policies.
The incident highlights a growing trend in the tech industry where user-generated content is being used to train AI models. Other platforms like X and Meta have updated their terms of service to allow for such practices, while LinkedIn has introduced an opt-out option for users who don't want their data used for AI training.
As AI technology advances, the balance between open platforms, user privacy, and ethical AI training becomes increasingly complex. Bluesky's situation underscores the need for clearer policies, user consent mechanisms, and industry-wide standards for the responsible use of public data in AI development.
Reference
[1]
[3]
[5]
Bluesky, a rising social media platform, has announced it will not use user content to train generative AI models, setting itself apart from competitors like X (formerly Twitter) and attracting privacy-conscious users.
7 Sources
7 Sources
X, the social media platform formerly known as Twitter, has updated its privacy policy to allow third-party collaborators to use user data for AI training purposes, sparking debates about user privacy and data rights.
8 Sources
8 Sources
LinkedIn has stopped collecting UK users' data for AI training following regulatory scrutiny. This move highlights growing concerns over data privacy and the need for transparent AI practices in tech companies.
8 Sources
8 Sources
LinkedIn, with its 930 million users, is using member data to train AI models, sparking a debate on data privacy and the need for transparent opt-out options. This practice has raised concerns among privacy advocates and users alike.
4 Sources
4 Sources
Meta faces scrutiny from Australian authorities over its use of user data for AI training. The company has admitted to scraping posts and photos from Facebook users since 2007 without explicit consent, raising privacy concerns.
8 Sources
8 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved