2 Sources
[1]
Training AI on Mastodon posts? That idea's extinct
Such rules could be tricky to enforce in the Fediverse, though Mastodon is the latest platform to push back against AI training, updating its terms and conditions to ban the use of user content for large language models (LLMs). "We want to make it clear," the federated platform stated in an email to users, "that training LLMs on the data of Mastodon users on our instances is not permitted." The announcement may feel like shutting the stable door after the horse has bolted, but it's still reassuring to know that users' rants on the platform, in theory, won't feed into the LLMs behind generative AI services. To be fair, enforcing such restrictions on a platform that prides itself on decentralization and openness could prove difficult. The terms apply only to Mastodon's own instances, not the wider Fediverse. It's possible to deploy a file to block AI crawlers, but that relies on those behind the bots respecting it rather than invoking fair use. Mastodon is not the only platform worried about its content being used for AI training. Another social media platform, Bluesky, recently said: "We do not use any of your content to train generative AI, and have no intention of doing so," but, as the service acknowledged, enforcement of such a rule outside its systems is challenging. As 2024 drew to a close, a million public posts from Bluesky's firehose API turned up in a training set. Earlier in June, discussion forum Reddit sued Anthropic, an AI business, over allegations [complaint is here - PDF] that content generated by its users was scraped in violation of contractual terms and technical barriers. The suit did not cite examples of any alleged violations by Anthropic after July 2024. In 2024, Reddit signed a data-sharing deal with OpenAI. Earlier that year, it signed an AI training deal with Google, having begun charging companies to use its data-downloading API in 2023. Mastodon's change highlights the concerns of users over how their data might be used, particularly on platforms that are, by their nature, as free and open as possible. The updates, including an increase in minimum age from 13 to 16, take effect from July 1. ®
[2]
Mastodon's New Terms Block AI Scraping, But Gaps Still Remain
Open-source social media platform Mastodon has changed its rules to block companies from using user posts to train AI systems. The new policy, which takes effect from July 1, specifically bans scraping data from its main server, namely 'Mastodon.social'. The updated terms of service now clearly say that automated tools like scrapers or data miners cannot collect information from the platform. This includes collecting data for purposes like archiving or training large language models (LLMs), which power AI tools such as chatbots. The terms state to users that, "You are prohibited from using the Instance (the main server) for the commission of harmful or illegal activities." Accordingly, Mastodon instructs that no one should attempt, or assist any other person who wants to attempt, to access users' data from its main server. This essentially means that one cannot collect data from the 'Mastodon.social' server using bots or other automated tools, unless it is for normal internet browsing or human review, etc. Pushback Against AI Overreach The change comes amid growing anger among users of various platforms about their public posts being used without permission to train AI systems. Similar concerns were recently raised on other networks like Bluesky, where a dataset of user posts was collected and uploaded for research purposes. AI data scraping has become a wider issue in the tech world. Major platforms such as Reddit have also taken action. In one high-profile case, Reddit filed a lawsuit against AI company Anthropic, accusing it of using Reddit content without permission to train its Claude chatbot. Mastodon's updated policy may come as a relief to those who worry about how AI is vacuuming up the internet's content. However this change only applies to 'Mastodon.social' -- the main server run by Mastodon gGmbH, the non-profit entity behind the platform. Mastodon operates as a decentralised network, with an array of various independently managed servers. Therefore, if other server administrators don't adopt similar rules, users across the broader "fediverse" could still face exposure. Mastodon Raises Age Limit For Users In addition to the scraping ban, Mastodon has also raised the minimum age -- for users around the world -- from 13 to 16 years. The updated policy bars anyone under 16 from accessing the server and requires users who cannot legally accept the terms to have a parent or guardian accept them instead. More Legal Protections, But With Limits The social media platform also updated its terms to include stronger rules against hacking attempts, and automated tools that can disrupt the platform. If a user flouts these terms, Mastodon reserves the right to deny them access, and even report them to law enforcement authorities. The updated terms also include an arbitration clause, meaning users who want to sue must go through a legal process in Germany. This could make it harder for users in other countries to challenge Mastodon in court. And while the platform gives users ownership of their own posts, it grants itself broad rights to use and share that content. Not a Full Win for Users While Mastodon's efforts to push back against AI scraping are a welcome move for privacy-conscious users, the policy's limited scope -- applying to only one server -- leaves much of the "fediverse" unprotected. It also highlights how difficult it is for decentralised platforms to act as one unit against AI firms harvesting online data. Unless more servers adopt the same rules as 'Mastodon.social', AI companies may still gorge on user content to train datasets, with or without users' consent. Also Read:
Share
Copy Link
Mastodon updates its terms to prohibit AI training on user content, but the decentralized nature of the platform poses challenges in enforcing these rules across the entire Fediverse.
In a significant move to protect user privacy, Mastodon, the open-source social media platform, has updated its terms and conditions to prohibit the use of user content for training large language models (LLMs). The new policy, set to take effect from July 1, 2025, specifically bans the scraping of data from its main server, Mastodon.social 1.
Source: The Register
"We want to make it clear that training LLMs on the data of Mastodon users on our instances is not permitted," the platform stated in an email to users 1. This decision aligns Mastodon with other platforms like Bluesky, which have expressed similar intentions to protect user data from AI training 1.
While the move is welcomed by privacy advocates, the decentralized nature of Mastodon poses significant challenges in enforcing these rules across the entire Fediverse. The new terms apply only to Mastodon's own instances, not the wider network of independently managed servers 2.
Eugen Rochko, founder of Mastodon, acknowledged the difficulty in enforcing such restrictions on a platform that prides itself on decentralization and openness. While it's possible to deploy a file to block AI crawlers, the effectiveness of this measure relies on the compliance of those behind the bots 1.
Mastodon's policy change comes amid growing concerns about AI companies using public online content without permission to train their models. This issue has sparked legal actions, such as Reddit's lawsuit against Anthropic for allegedly scraping user-generated content in violation of contractual terms 1.
The debate around AI training data has intensified, with platforms like Reddit signing data-sharing deals with AI companies like OpenAI and Google, while simultaneously taking legal action against others for unauthorized use of their data 1.
Along with the AI training ban, Mastodon has introduced other significant policy updates:
While Mastodon's efforts to protect user data are commendable, the limited scope of the policy highlights the challenges faced by decentralized platforms in presenting a unified front against AI data harvesting. Unless more servers in the Fediverse adopt similar rules, AI companies may still have access to vast amounts of user-generated content 2.
As the debate over AI training data continues, Mastodon's policy change serves as a significant milestone in the ongoing struggle between open, decentralized platforms and the need for user data protection in the age of AI.
Summarized by
Navi
[1]
Databricks raises $1 billion in a new funding round, valuing the company at over $100 billion. The data analytics firm plans to invest in AI database technology and an AI agent platform, positioning itself for growth in the evolving AI market.
12 Sources
Business
20 hrs ago
12 Sources
Business
20 hrs ago
Microsoft has integrated a new AI-powered COPILOT function into Excel, allowing users to perform complex data analysis and content generation using natural language prompts within spreadsheet cells.
9 Sources
Technology
20 hrs ago
9 Sources
Technology
20 hrs ago
Adobe launches Acrobat Studio, integrating AI assistants and PDF Spaces to transform document management and collaboration, marking a significant evolution in PDF technology.
10 Sources
Technology
19 hrs ago
10 Sources
Technology
19 hrs ago
Meta rolls out an AI-driven voice translation feature for Facebook and Instagram creators, enabling automatic dubbing of content from English to Spanish and vice versa, with plans for future language expansions.
5 Sources
Technology
11 hrs ago
5 Sources
Technology
11 hrs ago
Nvidia introduces significant updates to its app, including global DLSS override, Smooth Motion for RTX 40-series GPUs, and improved AI assistant, enhancing gaming performance and user experience.
4 Sources
Technology
20 hrs ago
4 Sources
Technology
20 hrs ago