The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 29 Aug, 4:04 PM UTC
3 Sources
[1]
Major Sites Are Saying No to Apple's AI Scraping
This summer, Apple gave websites more control over whether the company could train its AI models on their data. Major publishers and platforms like The New York Times and Facebook have already opted out. Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it. WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED's parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple's AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they've become a conflict zone over intellectual property and the future of the web. This new tool, Applebot-Extended, is an extension to Apple's web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training. (Apple calls this "controlling data usage" in a blog post explaining how it works.) The original Applebot, announced in 2015, initially crawled the internet to power Apple's search products like Siri and Spotlight. Recently, though, Applebot's purpose has expanded: The data it collects can also be used to train the foundational models Apple created for its AI efforts. Applebot-Extended is a way to respect publishers' rights, says Apple spokesperson Nadine Haija. It doesn't actually stop the original Applebot from crawling the website -- which would then impact how that website's content appeared in Apple search products -- but instead prevents that data from being used to train Apple's large language models and other generative AI projects. It is, in essence, a bot to customize how another bot works. Publishers can block Applebot-Extended by updating a text file on their websites known as the Robots Exclusion Protocol, or robots.txt. This file has governed how bots go about scraping the web for decades -- and like the bots themselves, it is now at the center of a larger fight over how AI gets trained. Many publishers have already updated their robots.txt files to block AI bots from OpenAI, Anthropic, and other major AI players. Robots.txt allows website owners to block or permit bots on a case-by-case basis. While there's no legal obligation for bots to adhere to what the text file says, compliance is a long-standing norm. (A norm that is sometimes ignored: Earlier this year, a WIRED investigation revealed that the AI startup Perplexity was ignoring robots.txt and surreptitiously scraping websites.) Applebot-Extended is so new that relatively few websites block it yet. Ontario, Canada-based AI-detection startup Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent -- predominantly news and media outlets -- were blocking Applebot-Extended. This week, the AI agent watchdog service Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites, finding that approximately 6 percent had the bot blocked. Taken together, these efforts suggest that the vast majority of website owners either don't object to Apple's AI training practices are simply unaware of the option to block Applebot-Extended.
[2]
Facebook, New York Times, and more, refuse to let Apple Intelligence train on their data
Future expansions to Apple Intelligence may involve more AI partners, paid subscriptions Website owners have a simple mechanism to tell Apple Intelligence not to scrape the site for training purposes, and reportedly major platforms like Facebook and the New York Times are using it. Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped. That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training. This robots.txt file is no technical barrier to scraping, nor even really a legal one, and there are firms that are known to ignore being blocked. Reportedly, many news sites that are blocking Apple Intelligence. Significant ones include: In Apple's case, Wired says that two main studies in the last week have shown that around 6% to 7% of high-traffic websites are blocking Apple's search tool, called Applebot-Extended. Then a further study by Ben Welsh, also undertaken in the last week, says that just over a 25% of sites checked are blocking it. The discrepancy is down to which sets of high-traffic websites were researched. The Welsh study, for comparison, found that OpenAI's bot is blocked by 53% of news sites checked, and Google's equivalent Google-Extended is blocked by almost 43%. Wired concludes that while sites might not care whether Apple Intelligence is scraping them, the major reason for low blocking figures is that Apple's AI bot is too little known for firms to notice it. Yet Apple Intelligence is not exactly hiding in the dark, and AppleBot-Extended is a superset of AppleBot. That was first spotted by sites in November 2014, and officially revealed by Apple in May 2015. So for ten years, AppleBot has been searching and scraping websites, and doing so in order to power Siri and Spotlight searches. Consequently, it's less likely that websites owners haven't heard of Apple Intelligence, and more likely that they have head of Apple making deals worth millions. While negotiations are continuing, or just conceivably might start, some sites are consciously blocking Apple Intelligence. That includes The New York Times, which is also suing OpenAI over copyright infringement because of its AI scraping. "As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission" says the newspaper's Charlie Stadtlander. "Importantly, copyright law still applies whether or not technical blocking measures are in place."
[3]
Many of the biggest websites opted out of Apple Intelligence training
Generative AI systems are trained by letting them surf the web to scrape content. Apple allows publishers to opt out of its scraping, and a new report says that many of the biggest websites have specifically opted out of Apple Intelligence training. This includes both Facebook and Instagram, as well as many high-profile news and media sites like The New York Times and The Atlantic ... Large language models like ChatGPT are trained by giving them access to millions of words of source material, ranging from news stories to user comments. In Apple's case, the company has for years been using Applebot to train Siri and surface Spotlight suggestions. More recently, the company has also been using Applebot to train Apple Intelligence. The practice is controversial, as AIs are effectively using copyrighted material to generate their own versions of it. For more niche topics, where source material is scarce, they have even been found to regurgitate entire paragraphs with almost no changes made. But Apple does this in an ethical way, allowing publishers to opt out, and screening out personal data (though it did get caught out by one third-party source). We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control [...] We apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. Apple uses an Applebot-Extended tag to allow sites to opt out of AI training while still allowing search indexing - meaning that their pieces can still be included in Spotlight and Siri searches. Since opting out is done using a publicly-accessible robots.txt file, it's easy to see which sites have done this. Wired checked a number of the biggest news and social media sites. WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED's parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple's AI training [...] In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended. Applebot-Extended is a relatively new tag, so it's likely that more websites will also opt out once awareness increases. Apple is believed to have struck deals with some media companies, paying a fee in return for the right to use their content for training. It's likely this is the motivation for at least some sites currently blocking Apple - holding out for a payment offer. "A lot of the largest publishers in the world are clearly taking a strategic approach," says Originality AI founder Jon Gillham. "I think in some cases, there's a business strategy involved -- like, withholding the data until a partnership agreement is in place." iOS 18.1 beta 3 includes several new Apple Intelligence features, including Photo Clean Up and more notification summaries.
Share
Share
Copy Link
Apple's efforts to train its AI models using web content are meeting opposition from prominent publishers. The company's web crawler, Applebot, has been increasingly active, raising concerns about data usage and copyright issues.
In a significant development for the tech industry, Apple's ambitious plans to enhance its artificial intelligence capabilities are facing resistance from major publishers. The Cupertino-based tech giant has been ramping up its web crawling activities through Applebot, its proprietary web crawler, in what appears to be an effort to gather data for training its AI models 1.
Applebot, which has been in operation since 2015, was initially used to improve Siri and Spotlight search results. However, recent observations indicate a substantial increase in its activity, suggesting a broader scope that likely includes data collection for AI training purposes 1. This expanded role aligns with Apple's growing interest in artificial intelligence and machine learning technologies.
As news of Apple's intensified web crawling spread, several high-profile publishers have taken steps to prevent their content from being used in Apple's AI training processes. Notable names such as CNN, Reuters, The New York Times, and Australian media giant News Corp have implemented measures to block Applebot from accessing their websites 2.
These publishers are utilizing robots.txt files, a standard method for instructing web crawlers on which parts of a website they are allowed to access. By modifying these files, they aim to exclude Applebot specifically, while potentially still allowing access to other search engine crawlers 3.
The pushback from publishers poses a significant challenge to Apple's AI ambitions. Access to diverse, high-quality content is crucial for training robust AI models. With major news outlets restricting access, Apple may face limitations in developing competitive AI products, particularly in areas like natural language processing and content generation 2.
This situation reflects a growing tension in the tech industry between AI companies' need for training data and content creators' rights. Similar controversies have emerged with other AI initiatives, such as those by OpenAI and Google, highlighting the complex issues surrounding data usage, copyright, and fair compensation in the AI era 1.
As of now, Apple has not publicly commented on the publishers' actions or its specific plans for AI development. The company's next moves will be closely watched by the industry, as they could set precedents for how tech giants navigate the delicate balance between innovation and respect for content creators' rights 3.
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Apple and Salesforce have responded to allegations that they used YouTube videos without permission to train their AI models. Both companies deny these claims, stating that their AI systems were not trained on such content.
14 Sources
14 Sources
Major tech companies, including Apple, Nvidia, and Anthropic, are facing allegations of using thousands of YouTube videos to train their AI models without proper authorization, sparking controversy and frustration among content creators.
27 Sources
27 Sources
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
2 Sources
2 Sources
Freelancer.com's CEO Matt Barrie alleges that AI company Anthropic engaged in unauthorized data scraping from their platform. The accusation raises questions about data ethics and the practices of AI companies in training their models.
2 Sources
2 Sources