Cloudflare Unveils Tools to Combat AI Data Scraping, Empowering Website Owners

13 Sources

[1]

Cloudflare launches bot management tools to charge AI bots for data scraping

Cloudflare has launched a new suite of tools designed to give website owners and content creators more control over how their content is used by artificial intelligence. The AI Audit tools address the growing concern that bots are scanning websites without permission or compensation for creators. Ongoing efforts to stomp out AI bots from Cloudflare follow earlier work to identify such unwanted scrapers via digital fingerprinting, designed to differentiate bots from legitimate web users. Bots themselves are not the issue - those used by search engines can provide value by indexing content and driving traffic to websites. Though AI arguably makes it more challenging to eliminate just certain kinds of bots. AI bots used by large language models often scrape publicly available data to train models without attributing or crediting sources, and without giving compensation to creators. This can lead to creators finding their work, or similarities, in AI-generated responses. Cloudflare's AI Audit was built to give website owners detailed analytics that offer transparency into which AI bots are accessing their sites, how often, and which parts. For example, it can differentiate bots like OpenAI's GPTBot and Anthropic's ClaudeBot. Cloudflare CEO Matthew Prince commented: "Content creators and website owners of all sizes deserve to own and have control over their content. If they don't, the quality of online information will deteriorate or be locked exclusively behind paywalls." As well as identifying bots, Cloudflare's tools offer controls and custom rules to permit or block AI services to eliminate scraping altogether or to align with deals that website owners may have struck up with certain AI companies.

[2]

pcgamer

Cloudflare is allowing websites to block AI from scraping them and can even make bots pay for access

This can offer better safeguarding for site owners and content creators. If you own or run a website and don't want AI bots crawling through your work to train its dataset -- and who does? -- Cloudflare has launched the ability to "block all AI bots in one click", with even more interesting features coming down the pipeline in the future. According to a new Cloudflare blog, as spotted by Ars Technica, the cloud-based content delivery and management service has enabled a whole host of tools to better manage concerns around AI from the sites it supports. The new AI functions, labelled "AI Audit", are split into three main actions. The first is that Cloudflare users can control bots' access to websites. In the blog post, it said: "Many small sites don't have the skills or bandwidth to manually block AI bots. The ability to block all AI bots in one click puts content creators back in control." This new system can not only do a better job at differentiating between 'good' and 'bad' uses of AI bots but also gives site owners the ability to curate which bots they block. Secondly, site owners now have a better feed of analytics from AI, like being able to spot how many are crawling a site and which uses of AI source their page. As a result of these in-depth analytics, users can better negotiate with AI bot owners for access to the site. Finally, though this isn't live with the rest of the tools, website owners can now directly sell access to AI crawlers. It can set prices for those accessing the site en masse, and get paid when they are cited or trained on. The first two actions can be monitored right now if you use Cloudflare, but the price setting is set to be supported at some point in the future, with the ability to sign up to a waiting list now. The Cloudflare CEO and co-founder Matthew Prince makes an interesting point in the blog for the inclusion of these AI monitoring tools. As sites are increasingly worried about their work being lifted without credit, and then boosted above them in search results, Prince talks about the effects this can have on both the reader and creator. "Content creators and website owners of all sizes deserve to own and have control over their content," Prince says. "If they don't, the quality of online information will deteriorate or be locked exclusively behind paywalls." If only AI bots are paying for those paywalls, it might disincentivise the land grab effects of mass generative AI theft on the online landscape. AI Audit has only just rolled out but it could offer better protections for site owners, especially when the price-setting tool arrives.

[3]

Decrypt

New Relief for AI Bot Sufferers: Cloudflare's New Tool Lets Sites Charge For Data Scraping - Decrypt

San Francisco-based cloud services company Cloudflare launched a new set of AI tools Monday that aims to give websites the ability to stop unauthorized scraping by AI crawlers -- or to charge them for access to their data. "What we've previewed today is the ability for site owners and internet publications to say, 'this is the value I expect to receive from my site,'" Sam Rhea, a Cloudflare vice president, told Decrypt. "If you're an AI LLM and you want to scan this content or train against it, or make it part of your search result, this is the value I expect to receive for that." The free Cloudflare Bot Management platform allows websites to not only block AI bots but to charge a fee to as many bots as they approve, thereby getting revenue for the platforms feasting for free on their content. The AI audit tool also gives users the ability to see how its content is being accessed. As Rhea explained, unlike malicious bots that try to crash websites or cut in line ahead of human customers attempting to access a website, AI crawlers don't aim to harm or steal but scan public content to train large language models. Sometimes those bots attribute the information back to the source, plausibly sending valuable traffic, Rhea said. "But other times, they take material, put it in a blender, and share it as if it were just part of a generic source, without any citation. That seems dangerous to me." Rhea said as far as Cloudflare, which provides security and performance optimization for websites, could tell, no single platform dominates website scraping activity, adding that it varies by the type of content being scraped at any given time. Generative AI models require large amounts of data to function and attempt to provide fast and accurate answers as well as create images, videos, and music. AI scrapers are a growing industry and include companies like LAION, Defined.AI, Aleph Alpha, and Replicate that provide AI developers with pre-collected text, voice, and image datasets. According to market research firm Research Nester, the web scraping software industry is estimated to reach $2.45 billion by 2036. Last year, Ed Newton-Rex, the former head of audio at Stability AI, resigned over how AI platforms claimed that ingesting website data was "fair use." "'Fair use' wasn't designed with generative AI in mind -- training generative AI models in this way is, to me, wrong," he said. "Companies worth billions of dollars are, without permission, training generative AI models on creators' works, which are then being used to create new content that in many cases can compete with the original works." Newton-Rex added: "I don't see how this can be acceptable in a society that has set up the economics of the creative arts such that creators rely on copyright." Rhea said smaller AI developers seemed willing to pay to receive selected website content. "From the conversations we've had with foundational model providers and new entrants in the space, is that the kind of ocean of high-quality data is becoming difficult to find," he said, noting that scientific and mathematical content was especially in demand.

[4]

SiliconANGLE

Cloudflare debuts tools for website owners to charge AI companies that scrape their content - SiliconANGLE

Cloudflare debuts tools for website owners to charge AI companies that scrape their content Earlier this year, Cloudflare Inc. announced a simple tool for website owners to prevent artificial intelligence model developers from scraping their online content. Now, it's building on that with additional capabilities that can help website owners to control how their content is used by AI models, and even try to make money from it. The company said its AI Audit product provides a suite of tools to help customers understand how AI models are using their content. Once they know what their content is being used for, they'll then be able to decide if they're willing to let AI developers access it or not. Moreover, they'll also be able to set what they consider is a "fair price" for AI scrapers to use their content for model training and other purposes. The practice of scraping websites for content has become extremely common in the AI industry, with the internet providing a treasure trove of ostensibly "free" data that can be used to train AI models. But this mass scraping of websites is controversial too, with many content creators and publishers arguing that it's unfair, especially since they're unaware it's happening. The biggest AI providers today are all guilty of scraping content from the web, including the likes of OpenAI, Google LLC, Meta Platforms Inc., Stability AI Ltd., IBM Corp. and Microsoft Corp. These companies all openly admit to helping themselves to publishers' content, arguing that the practice falls under the "fair use" doctrine. But critics say that it's having a detrimental impact on publishers, since they lose out on web traffic as a result of having their content scraped. For example, a website that posts food recipes will lose a ton of traffic - and potential revenue - to AI chatbots that use their content to quickly respond to requests for a recipe. Because the chatbot provides the user with all of the information they've asked for, there's little incentive for anyone to actually go visit that website, even if the chatbot cites it as the source of its response. Some publishers have responded to this by taking steps to block AI developers from accessing their websites. Last month, the Guardian reported that The New York Times, CNN, Reuters and the Chicago Tribune had all blocked OpenAI's GPTBot web crawler from scanning their websites. Meanwhile, others have countered by enabling AI developers to access their content for a price. Reddit Inc., one of the world's busiest forums, said in April it is launching an application programming interface that will enable AI companies to pay to access its content, ensuring it is fairly compensated. With its latest update today, Cloudflare says, it's helping every website developer to do something similar. AI Audit is designed to give control back to content creators, so there can be a more transparent exchange between the two parties. It includes a simple, one-click tool that automatically prevents every kind of AI scraper from accessing their content, plus a suite of analytics tools that can help website owners to understand what AI bots are doing on their properties. According to Cloudflare, it can help site owners to understand why, when and how often AI models are accessing their web pages, and even make a distinction between AI bots that credit the source of their data and those that don't. In addition, Cloudflare's AI Audit also provides a tool for website owners to determine a fair price for allowing bots to access their content, based on the standard going rates negotiated by bigger publishers such as Reddit. Cloudflare says this is necessary because many smaller site owners lack the resources and expertise to understand the value of their content and negotiate deals with AI companies. Moreover, the AI companies themselves simply don't have the bandwidth to cut a deal with every single website they scrape, because there are millions of them. Cloudflare's AI Audit tab helps to define the metrics that are commonly used to establish a fair price for scraping, such as the rate of crawling for certain sections of content of an entire page or website. Based on this data, it will then recommend a price and transaction flow. That enables AI developers quickly find new sources of content and pay for them, compensating the creators. Cloudflare co-founder and Chief Executive Matthew Prince said AI will forever transform the way people interact with content online, so it's necessary for every stakeholder to get together and determine what this future will look like. But he believes it's important for content creators to be able to own and control their content. "If content creators don't have this control, the quality of online information will deteriorate or be locked exclusively behind paywalls," Prince said. "With Cloudflare's scale and global infrastructure, we believe we can provide the tools and set the standards to give websites, publishers, and content creators control and fair compensation for their contribution to the Internet, while still enabling AI model providers to innovate."

[5]

Dataconomy

Small websites fight back: Cloudflare's new tools against AI content scraping

In a significant move, Cloudflare has announced a new bundle of tools for online publishers. These tools are created to provide website owners with control over applying AI models to their content. The company intends to level the playing field for smaller publishers that commonly have their content taken without consent or pay in the form of AI-based scraping. New tools have been introduced that empower users to observe the actions of AI bots, which could lead to monetizing content access in a future marketplace. This endeavor represents a significant instant when digital content creators can achieve security and gain benefits from their efforts in artificial intelligence. The service delivers comprehensive analytics, revealing the periods and frequency of AI bots accessing websites. It also allows website owners to exclude or include specific bots with a simple click. The move has been adopted in response to rising concerns about how AI models affect smaller publishers. As rivals in the AI industry, such as OpenAI, Microsoft, and Meta, consistently mine the web for content to improve large language models (LLMs), many small websites contribute significant data but fail to gain traffic or revenue. This has raised the fear that the business models of smaller publishers might collapse should users choose AI-driven tools like ChatGPT over original website visits. Matthew Prince, who heads Cloudflare, emphasized the critical nature of fair payments to content creators. "If you don't compensate creators one way or another, then they stop creating, and that's the bit that has to get solved," Prince told TechCrunch. This declaration stresses the firm's mission of forming a more equitable digital ecosystem where content creators can have input on how their work is used. In furtherance of AI Audit, Cloudflare plans to launch a marketplace next year, permitting website owners to sell their content to providers of AI models. This platform will help smaller publishers negotiate agreements similar to those major players like Reddit and Condé Nast have already finalized. The exact details of the marketplace are still being finalized. However, the concept is clear: Content creators can charge AI bots for scraping their sites by introducing a fee or by requesting proper attribution. The initiative addresses a key challenge in the AI era: making sure small publishers can endure and succeed despite the growth of generative AI. "We believe we can provide the tools and set the standards to give websites, publishers, and content creators control and fair compensation for their contribution to the Internet while enabling AI model providers to innovate," Prince said in a company blog post. The actions of Cloudflare are opportune for escalating worries about content scraping driven by artificial intelligence technologies. A few months ago, publishers like The New York Times and CNN banned the OpenAI's GPTBot from gathering information from their sites. Some have reported that intense data scraping has triggered a rise in service costs and decreased site performance, emphasizing the need for better controls. Cloudflare is meeting urgent issues for content creators through these new tools, reshaping the long-term interplay between AI and content creation. To promote a lasting digital ecosystem, maintaining a just balance between technology development and fair wages for creators is critical as AI advances.

[6]

The Register

Cloudflare reins in AI scraper bots with new Audit panel

Cloudflare on Monday expanded its defense against the dark arts of AI web scrapers by providing customers with a bit more visibility into, and control over, unwelcome content raids. The network biz earlier this year deployed a one-click AI bot defense to improve upon the not very effective robots.txt mechanism, a way websites can ask, but not require, bots to behave. Cloudflare is now upgrading its arsenal with an AI Audit control panel. The idea is to provide customers with analytics data about crawlers that harvest data for AI training and inference so better decisions can be made about whether to embrace the bots or turn them away. "Some customers have already made decisions to negotiate deals directly with AI companies," explained Sam Rhea, a member of Cloudflare's emerging technology and incubation team. "Many of those contracts include terms about the frequency of scanning and the type of content that can be accessed. We want those publishers to have the tools to measure the implementation of these deals." Rhea says the problem is that the emergence of AI bots has made it more complicated to determine whether programmatic access to a website is beneficial or abusive. While they're not conducting a denial of service attack, bots that capture site data to train AI models or serve AI search results can still present a business threat. "AI Data Scraper bots scan the content on your site to train new LLMs," said Rhea. "Your material is then put into a kind of blender, mixed up with other content, and used to answer questions from users without attribution or the need for users to visit your site." As software developer Simon Willison has described it, AI training is akin to "money laundering for copyrighted data." Because companies like OpenAI and Anthropic do not disclose the training data used to create their models, AI is essentially content laundering. It's similar to a crypto mixer - a process intended to disguise the provenance of cryptocurrency. Then, there are AI Search Crawler bots that scan content and cite it back in response to search queries. "The downside is that those users might just stay inside of that interface, rather than visit your site, because an answer is assembled on the page in front of them," said Rhea. That is to say, AI search may not drive traffic to source websites, and thus doesn't provide ad revenue. The issue came up over the summer when iFixit CEO Kyle Wiens objected to data harvesting by Anthropic's crawlers, a situation the AI firm has since addressed. Rhea argues that allowing AI bots to run rampant threatens the open internet. "Without the ability to control scanning and realize value, site owners will be discouraged to launch or maintain Internet properties," he said. "Creators will stash more of their content behind paywalls and the largest publishers will strike direct deals. AI model providers will in turn struggle to find and access the long tail of high-quality content on smaller sites." Enter Cloudflare's AI Audit control panel. The network biz believes companies can use the provided bot analytics to monitor content access deals with AI firms, which it claims are becoming more common, and enforce policies rather than trusting crawlers to obey robots.txt directives. ®

[7]

TechCrunch

Cloudflare's new marketplace lets websites charge AI bots for scraping

Cloudflare announced plans on Monday to launch a marketplace in the next year where website owners can sell AI model providers access to scrape their site's content. The marketplace is the final step of Cloudflare CEO Matthew Prince's larger plan to give publishers greater control over how and when AI bots scrape their websites. "If you don't compensate creators one way or another, then they stop creating, and that's the bit which has to get solved," said Prince in an interview with TechCrunch. As a means to get there, Cloudflare launched free observability tools for customers, called AI Audit, on Monday. Website owners will get a dashboard to view analytics on why, when, and how often AI models are crawling their sites for information. Cloudflare will also let customers block AI bots from their sites with the click of a button. Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial. A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare's tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site. Cloudflare is trying to address a problem looming over the AI industry: how will smaller publishers survive in the AI era if people go to ChatGPT instead of their website? Today, AI model providers scrape thousands of small websites for information that powers their LLMs. While some larger publishers have struck deals with OpenAI to license content, most websites get nothing, but their content is still fed into popular AI models on a daily basis. That could break the business models for many websites, reducing traffic they desperately need. Earlier this summer, AI-powered search startup Perplexity was accused of scraping websites that deliberately indicated they did not want to be crawled using the Robots Exclusion Protocol. Shortly after, Cloudflare released a button to ensure customers could block all AI bots with one click. "That was out of frustration we were hearing, where people were feeling like their content was being stolen," said Prince. Some website owners told Business Insider that AI bots were scraping their websites so much, it felt like a DDoS attack was crippling their servers. Having your website scraped can not only be upsetting, but it can literally run up your cloud bill and impact your service. But what if you wanted to block Perplexity's bots, but not OpenAI's? Prince tells TechCrunch that Cloudflare's customers are asking for tools that allow them to choose what AI models have access to their sites. Cloudflare's new tools launching today will allow customers to block some AI crawlers, while letting others through. Even large publishers that have struck licensing deals with OpenAI - such as TIME, Condé Nast, and The Atlantic - have relatively little insight into how much ChatGPT is scraping their websites, according to Prince. Many of them have to accept what OpenAI tells them, but the answer determines if the publishers are getting a good licensing deal or not. But Cloudflare's marketplace, launching sometime in the next year, aims to give small publishers to strike deals with AI model providers as well. "Let's give all of you have the ability to do what only Reddit, Quora, and the big publishers of the world have done previously," said Prince. "What if we let you set, effectively, a price for accessing and taking your content to ingest into these systems." While it's a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like. Prince says websites could charge AI model providers based on the rates at which they're scraping individual websites, but it's unclear how much they will really pay. Further, he says websites could charge a monetary price to be scraped, or simply ask AI labs to give them credit. The details are fuzzy. While AI companies may not initially be excited about paying for content they currently get for free, Cloudflare's CEO says he thinks this is ultimately good for the AI ecosystem. Prince says the current landscape, where some AI companies don't pay for content ever, is not sustainable.

[8]

Wired

New Cloudflare Tools Let Sites Detect and Block AI Bots for Free

"The path we're on isn't sustainable," Cloudflare CEO Matthew Prince tells WIRED, in reference to rampant AI scraping. Here's his plan to course-correct. Internet infrastructure firm Cloudflare is launching a suite of tools that could help shift the power dynamic between AI companies and the websites they crawl for data. Today it's giving all of its customers -- including the estimated 33 million using its free services -- the ability to monitor and selectively block AI data-scraping bots. That preventative measure comes in the form of a suite of free AI auditing tools it calls Bot Management, the first of which allows real-time bot monitoring. Customers will have access to a dashboard showing which AI crawlers are visiting their websites and scraping data, including those attempting to camouflage their behavior. "We've labeled all the AI crawlers, even if they try to hide their identity," says Cloudflare cofounder and CEO Matthew Prince, who spoke to WIRED from the company's European headquarters in Lisbon, Portugal, where he's been based the past few months. Cloudflare has also rolled out an expanded bot-blocking service, which gives customers the option to block all known AI agents, or block some and allow others. Earlier this year, Cloudflare debuted a tool that allowed customers to block all known AI bots in one go; this new version offers more control to pick and choose which bots they want to block or permit. It's a chisel rather than a sledgehammer, increasingly useful as publishers and platforms strike deals with AI companies that allow bots to roam free. "We want to make it easy for anyone, regardless of their budget or their level of technical sophistication, to have control over how AI bots use their content," Prince says. Cloudflare labels bots according to their functions, so AI agents used to scrape training data are distinguished from AI agents pulling data for newer search products, like OpenAI's SearchGPT. Websites typically try to control how AI bots crawl their data by updating a text file called Robots Exclusion Protocol, or robots.txt. This file has governed how bots scrape the web for decades. It's not illegal to ignore robots.txt, but before the age of AI it was generally considered part of the web's social code to honor the instructions in the file. Since the influx of AI-scraping agents, many websites have attempted to curtail unwanted crawling by editing their robots.txt files. Services like the AI agent watchdog Dark Visitors offer tools to help website owners stay on top of the ever-increasing number of crawlers they might want to block, but they've been limited by a major loophole: unscrupulous companies tend to simply ignore or evade robots.txt commands.

[9]

Ars Technica

Cloudflare lets sites block AI crawlers with one click

Cloudflare may charge an app store-like fee for its AI-scraping data marketplace. Cloudflare announced new tools Monday that it claims will help end the era of endless AI scraping by giving all sites on its network the power to block bots in one click. That will help stop the firehose of unrestricted AI scraping, but, perhaps even more intriguing to content creators everywhere, Cloudflare says it will also make it easier to identify which content that bots scan most, so that sites can eventually wall off access and charge bots to scrape their most valuable content. To pave the way for that future, Cloudflare is also creating a marketplace for all sites to negotiate content deals based on more granular AI audits of their sites. These tools, Cloudflare's blog said, give content creators "for the first time" ways "to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it." That's necessary for content creators because the rise of generative AI has made it harder to value their content, Cloudflare suggested in a longer blog explaining the tools. Previously, sites could distinguish between approving access to helpful bots that drive traffic, like search engine crawlers, and denying access to bad bots that try to take down sites or scrape sensitive or competitive data. But now, "Large Language Models (LLMs) and other generative tools created a murkier third category" of bots, Cloudflare said, that don't perfectly fit in either category. They don't "necessarily drive traffic" like a good bot, but they also don't try to steal sensitive data like a bad bot, so many site operators don't have a clear way to think about the "value exchange" of allowing AI scraping, Cloudflare said. That's a problem because enabling all scraping could hurt content creators in the long run, Cloudflare predicted. "Many sites allowed these AI crawlers to scan their content because these crawlers, for the most part, looked like 'good' bots -- only for the result to mean less traffic to their site as their content is repackaged in AI-written answers," Cloudflare said. All this unrestricted AI scraping "poses a risk to an open Internet," Cloudflare warned, proposing that its tools could set a new industry standard for how content is scraped online. How to block bots in one click Increasingly, creators fighting to control what happens with their content have been pushed to either sue AI companies to block unwanted scraping, as The New York Times has, or put content behind paywalls, decreasing public access to information. While some big publishers have been striking content deals with AI companies to license content, Cloudflare is hoping new tools will help to level the playing field for everyone. That way, "there can be a transparent exchange between the websites that want greater control over their content, and the AI model providers that require fresh data sources, so that everyone benefits," Cloudflare said. Today, Cloudflare site operators can stop manually blocking each AI bot one by one and instead choose to "block all AI bots in one click," Cloudflare said. They can do this by visiting the Bots section under the Security tab of the Cloudflare dashboard, then clicking a blue link in the top-right corner "to configure how Cloudflare's proxy handles bot traffic," Cloudflare said. On that screen, operators can easily "toggle the button in the 'Block AI Scrapers and Crawlers' card to the 'On' position," blocking everything and giving content creators time to strategize what access they want to re-enable, if any. Beyond just blocking bots, operators can also conduct AI audits, quickly analyzing which sections of their sites are scanned most by which bots. From there, operators can decide which scraping is allowed and use sophisticated controls to decide which bots can scrape which parts of their sites. "For some teams, the decision will be to allow the bots associated with AI search engines to scan their Internet properties because those tools can still drive traffic to the site," Cloudflare's blog explained. "Other organizations might sign deals with a specific model provider, and they want to allow any type of bot from that provider to access their content." For publishers already playing whack-a-mole with bots, a key perk would be if Cloudflare's tools allowed them to write rules to restrict certain bots that scrape sites for both "good" and "bad" purposes to keep the good and throw away the bad. Perhaps the most frustrating bot for publishers today is the Googlebot, which scrapes sites to populate search results as well as to train AI to generate Google search AI overviews that could negatively impact traffic to source sites by summarizing content. Publishers currently have no way of opting out of training models fueling Google's AI overviews without losing visibility in search results, and Cloudflare's tools won't be able to get publishers out of that uncomfortable position, Cloudflare CEO Matthew Prince confirmed to Ars. For any site operators tempted to toggle off all AI scraping, blocking the Googlebot from scraping and inadvertently causing dips in traffic may be a compelling reason not to use Cloudflare's one-click solution. However, Prince expects "that Google's practices over the long term won't be sustainable" and "that CloudFlare will be a part of getting Google and other folks that are like Google" to give creators "much more granular control over" how bots like the Googlebot scrape the web to train AI. Prince told Ars that while Google solves its "philosophical" internal question of whether the Googlebot's scraping is for search or for AI, a technical solution to block one bot from certain kinds of scraping will likely soon emerge. And in the meantime, "there can also be a legal solution" that "can rely on contract law" based on improving sites' terms of service. Not every site would, of course, be able to afford a lawsuit to challenge AI scraping, but to help creators better defend themselves, Cloudflare drafted "model terms of use that every content creator can add to their sites to legally protect their rights as sites gain more control over AI scraping." With these terms, sites could perhaps more easily dispute any restricted scraping discovered through Cloudflare's analytics tools. "One way or another, Google is going to get forced to be more fine-grained here," Prince predicted.

[10]

ZDNet

Cloudflare's new AI Audit tool aims to give content creators better bot controls

Some people have told me recently that artificial intelligence (AI) is writing my stories. As if! However, there is a reason it might seem as if AI has provided a helping hand. I've published close to 15,000 stories, five books, and hundreds of white papers. Put all that content together, and I've published about 12 million words. Of those, I'd guestimate that half a million words have been about Linux and open source. Now, networking company Cloudflare has a solution for people who want to know how their content is used: AI Audit. The tool provides website owners with features to analyze and control how AI bots interact with their content. AI Audit offers detailed analytics that enables website owners to see: This visibility helps content creators make informed decisions about how they want their content to be used by AI models. The tool provides a simple yet powerful control mechanism. With one click, you can block all AI bots. Also: I tested 7 AI content detectors - they're getting dramatically better at identifying plagiarism This feature gives you immediate control, so you can take the time to figure out what the bots are doing to your traffic and business. For more granular control, AI Audit offers: The AI Audit tab will be accessible to existing customers through the Cloudflare dashboard. The tool integrates with Cloudflare's global infrastructure, leveraging the company's scale to provide these auditing features across the internet. To give it a try, you can join the AI Value Tool Waitlist. I'll be joining. Cloudflare will share further technical and practical details at its first Builder Day Live Stream on September 26 at 11 AM PT. Looking ahead, Cloudflare is developing additional features. These will include a pricing mechanism that allows you to set fair prices for AI companies to access your content for training and retrieval augmented generation. Eventually, the tool should create a seamless transaction flow between you and the AI companies. After all, why should they get all the billions? Also: The best AI chatbots of 2024: ChatGPT, Copilot and worthy alternatives As Matthew Prince, Cloudflare's co-founder and CEO, said in a statement: "AI will dramatically change content online, and we must all decide together what its future will look like. Content creators and website owners of all sizes deserve to own and have control over their content. If they don't, the quality of online information will deteriorate or be locked exclusively behind paywalls." Prince is right. It's not just our websites, stories, art, and videos -- everything you've ever put up on the web will be, if it's not already, sucked into an AI vacuum. For example, LinkedIn now consumes your personal data by default to train AIs. You can stop LinkedIn from vacuuming your data, but you must manually block the company. It's not just LinkedIn that's vacuuming data. Meta's been doing it for over a year now. This approach is all by design, and the businesses don't make it easy to opt out. In fact, they make it as hard as possible. Also: Is that photo real or AI? Google's 'About this image' aims to help you tell the difference "Most companies add the friction because they know that people aren't going to go looking for it," said Thorin Klosowski, an Electronic Frontier Foundation security and privacy activist, to Wired. This situation is why I hope Cloudflare is successful. The tool won't help people on social networks, but it can protect your work if you have a blog or something similar. The tool will also help smaller websites. While major media companies, such as Condé Nast, sell whatever you've shared on Reddit to OpenAI, and can -- we presume -- make real money by selling content, smaller sites have no leverage to make a deal. Until, just possibly, now. Also: Apple Intelligence arrives next month: 6 AI upgrades iPhone users can expect first Cloudflare is a major internet power. By making it simple to stop bots in their tracks and setting up a mechanism to make AI companies pay for your content, you may profit from your stories, art, and music without needing to be a privacy programming expert and a savvy business negotiator. I, for one, won't get rich, but it would be nice if I got something tangible from the AI companies using my work, besides rude notes from the clueless dweebs who accuse me of generating content automatically.

[11]

Fortune

Cloudflare is arming content creators with free weapons in the battle against AI bot crawlers

Artificial Intelligence companies eager for training data have forced many websites and content creators into a relentless game of whack-a-mole, battling increasingly aggressive web crawler bots that continuously scrape their data to train AI models. In just one example, repair database iFixIt complained in July that a web crawler bot for Anthropic's AI chatbot Claude hit its website nearly a million times in a single day. Of course, bot crawlers have been around for decades, either for good (to gather data for search engines that help people discover sites) or bad (malicious bots seeking to take down websites). The bots crawling for AI training data have fallen into a murky third category -- a website might want to block them all, or to allow some access to scrape data as part of licensing agreements or in the hopes of being cited in a chatbot answer. This summer, Cloudflare -- which, as one of the world's largest networks underlying the global internet, has a long history of offering services to block malicious bots -- began arming content creators with what it called the equivalent of a free "easy button" to block all website crawlers with one click. However, while it was useful, the feature was also a blunt instrument, Cloudflare CEO Matthew Prince tells Fortune. It could not differentiate between crawlers scavenging for AI training data and those crawling for search engines. In addition, customers could not decide to block one crawler but not another. "People didn't know whether to push the button or not," he said. Today, the company has added to its cadre of weapons with what it says are more precise tools that offer websites and content creators more control over who can access their data, as well as the ability to analyze how their content is used by AI models. Now a website can use new filters that give OpenAI permission to crawl its website, but not Baidu or Perplexity, and it can also control which areas of the website an AI company is permitted to access. Cloudflare maintains that its analytics can also help those signing licensing agreements with model providers understand the metrics used in negotiations, such as the rate for crawling certain sections or the entire page. Once the 40 million websites that use Cloudflare begin taking advantage of the new features, the company also hopes to become a central marketplace for them to negotiate with AI model providers (who also use Cloudflare) to license their data. Site owners could set a price for their site, or sections of their site, and then charge model providers. Prince says Cloudflare is uniquely positioned to act as the go-between. "When we say, listen, we're going to set these rules, that's something that AI companies pay attention to, because it immediately has an impact on north of 20% of the web," said Prince. Cloudflare's relationships with the major AI companies, he explained, creates a two-sided market. Cloudflare's efforts, he added, are essential for the open internet to continue because without the ability to control how sites are crawled by AI companies seeking to train models, content creators will either stop creating or put more of their content behind paywalls. While large publishers may strike direct deals, the AI model providers will struggle to access high-quality content from smaller websites. "I believe Cloudflare will be the company that is able to solve what I think is the key problem to make sure that content continues to be created online in a new, increasingly AI-powered web," said Prince.

[12]

TechCrunch

Cloudflare's new marketplace will let websites charge AI bots for scraping | TechCrunch

[13]

CXOToday.com

Cloudflare Helps Content Creators Regain Control of their Content from AI Bots

With new tools to automatically control how AI bots can access content, Cloudflare is the first to stand up for content creators at scale Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced AI Audit, a set of tools to help websites of any size analyze and control how their content is used by artificial intelligence (AI) models. For the first time, website and content creators will be able to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it. Additionally, Cloudflare is developing a new feature where content creators can reliably set a fair price for their content that is used by AI companies for model training and retrieval augmented generation (RAG). Website owners, whether for-profit companies, media and news publications, or small personal sites, may be surprised to learn AI bots of all types are scanning their content thousands of times every day without the content creator knowing or being compensated, causing significant destruction of value for businesses large and small. Even when website owners are aware of how AI bots are using their content, they lack a sophisticated way to determine what scanning to allow and a simple way to take action. For society to continue to benefit from the depth and diversity of content on the Internet, content creators need the tools to take back control. "AI will dramatically change content online, and we must all decide together what its future will look like," said Matthew Prince, co-founder and CEO, Cloudflare. "Content creators and website owners of all sizes deserve to own and have control over their content. If they don't, the quality of online information will deteriorate or be locked exclusively behind paywalls. With Cloudflare's scale and global infrastructure, we believe we can provide the tools and set the standards to give websites, publishers, and content creators control and fair compensation for their contribution to the Internet, while still enabling AI model providers to innovate." With AI Audit, Cloudflare aims to give content creators information and take back control so there can be a transparent exchange between the websites that want greater control over their content, and the AI model providers that are in need of fresh data sources, so that everyone benefits. With this announcement, Cloudflare aims to help any website: Automatically control AI bots, for free: AI is a quickly evolving space, and many website owners need time to understand and analyze how AI bots are affecting their traffic or business. Many small sites don't have the skills or bandwidth to manually block AI bots. The ability to block all AI bots in one click puts content creators back in control.Tap into analytics to see how AI bots access their content: Every site using Cloudflare now has access to analytics to understand why, when, and how often AI models access their website. Website owners can now make a distinction between bots - for example, text generative bots that still credit the source of the data they use when generating a response, versus bots that scrape data with no attribution or credit.Better protect their rights when negotiating with model providers: An increasing number of sites are signing agreements directly with model providers to license the training and retrieval of content in exchange for payment. Cloudflare's AI Audit tab will provide advanced analytics to understand metrics that are commonly used in these negotiations, like the rate of crawling for certain sections or the entire page. Cloudflare will also model terms of use that every content creator can add to their sites to legally protect their rights.Set a fair price for the right to scan content and transact seamlessly (in development): Many site owners, whether they are the large companies of the future or a high quality individual blog, do not have the resources, context, or expertise to negotiate one-off deals that larger publishers are signing with AI model providers, and AI model providers do not have the bandwidth to do this with every site that approaches them. In the future, even the largest content creators will benefit from Cloudflare's seamless price setting and transaction flow, making it easy for model providers to find fresh content to scan they may otherwise be blocked from, and content providers to take control and be paid for the value they create. Existing Cloudflare customers can access the AI Tab from their dashboard today to review analytics for their sites and start controlling bots now. Site owners can visit https://www.cloudflare.com/lp/ai-value-tool-waitlist/ to join a waitlist to participate in the beta for price setting capabilities. About Cloudflare Cloudflare, Inc. (NYSE: NET) is the leading connectivity cloud company on a mission to help build a better Internet. It empowers organizations to make their employees, applications and networks faster and more secure everywhere, while reducing complexity and cost. Cloudflare's connectivity cloud delivers the most full-featured, unified platform of cloud-native products and developer tools, so any organization can gain the control they need to work, develop, and accelerate their business.

Twitter

Facebook

Copy Link

Cloudflare introduces new bot management tools allowing website owners to control AI data scraping. The tools enable blocking, charging, or setting conditions for AI bots accessing content, potentially reshaping the landscape of web data collection.

Cloudflare's New Bot Management Tools

Cloudflare, a leading internet security and performance company, has launched a suite of bot management tools designed to give website owners unprecedented control over how artificial intelligence (AI) bots interact with their content 1. This move comes in response to the growing concerns about large-scale data scraping by AI companies for training their models.

Features of the New Tools

The new tools offer website owners several options to manage AI bot access:

Blocking: Completely prevent AI bots from accessing the site.
Charging: Implement a paywall for AI bots to access content.
Conditional Access: Set specific terms for AI bots to follow when scraping data 2.

These features aim to empower content creators and website owners to protect their intellectual property and potentially monetize their data.

Implications for AI Companies and Content Creators

The introduction of these tools could significantly impact how AI companies gather training data. Large tech firms like OpenAI, Anthropic, and Google, which rely on web scraping for AI model training, may face new challenges in accessing data 3.

For content creators and smaller websites, this development offers a way to assert control over their content and potentially benefit from its use in AI training 5.

Technical Implementation

Cloudflare's system uses machine learning to identify AI bot behavior and distinguish it from regular user traffic. Website owners can customize their preferences through Cloudflare's dashboard, setting specific rules for different types of bots 4.

Industry Reactions and Future Outlook

The move has been met with mixed reactions. While many content creators welcome the ability to protect their work, some argue that open access to information is crucial for AI advancement. This development may lead to negotiations between AI companies and content providers, potentially establishing new norms for data usage in AI training.

As the AI industry continues to evolve, Cloudflare's tools represent a significant shift in the dynamics of web data collection. The long-term effects on AI development, content creation, and internet accessibility remain to be seen, but it's clear that the landscape of AI training data acquisition is changing rapidly.

References

Summarized by

Navi

[1]

TechRadar

|Cloudflare launches bot management tools to charge AI bots for data scraping

[2]

pcgamer

|Cloudflare is allowing websites to block AI from scraping them and can even make bots pay for access

[3]

Decrypt

|New Relief for AI Bot Sufferers: Cloudflare's New Tool Lets Sites Charge For Data Scraping - Decrypt

[4]

SiliconANGLE

|Cloudflare debuts tools for website owners to charge AI companies that scrape their content - SiliconANGLE

[5]

Dataconomy

|Small websites fight back: Cloudflare's new tools against AI content scraping

Explore today's top stories

Meta's Ambitious AI Data Center Expansion: Zuckerberg's Vision for Superintelligence

Meta, under Mark Zuckerberg's leadership, is rapidly expanding its AI infrastructure with plans for multiple gigawatt-scale data centers, including the 5GW 'Hyperion' project, to compete in the AI race and develop superintelligence.

29 Sources

Technology

20 hrs ago

Meta's Ambitious AI Data Center Expansion: Zuckerberg's

29 Sources

Technology

20 hrs ago

Musk's xAI Secures $200M Pentagon Contract Amid Grok Controversy

xAI, Elon Musk's AI company, lands a $200 million contract with the US Department of Defense for its Grok AI model, just days after the chatbot's antisemitic incident. The deal raises questions about AI in defense and Musk's government ties.

21 Sources

Technology

20 hrs ago

Musk's xAI Secures $200M Pentagon Contract Amid Grok

21 Sources

Technology

20 hrs ago

Elon Musk's Grok AI Introduces Controversial "Companions" Feature

Elon Musk's xAI has launched a new "Companions" feature for its Grok AI chatbot, including anime-style characters, sparking debates about AI ethics and societal impact.

9 Sources

Technology

20 hrs ago

Elon Musk's Grok AI Introduces Controversial "Companions"

9 Sources

Technology

20 hrs ago

Meta Considers Abandoning Open-Source AI Model in Major Strategy Shift

Meta's new Superintelligence Lab is discussing a potential shift from its open-source AI model, Behemoth, to a closed model, marking a significant change in the company's AI strategy.

5 Sources

Technology

4 hrs ago

Meta Considers Abandoning Open-Source AI Model in Major

5 Sources

Technology

4 hrs ago

Amazon Launches Kiro: A New AI-Powered IDE to Revolutionize Software Development

Amazon Web Services introduces Kiro, an AI-powered Integrated Development Environment (IDE) designed to streamline the software development process and address the limitations of vibe coding.

9 Sources

Technology

20 hrs ago

Amazon Launches Kiro: A New AI-Powered IDE to Revolutionize

9 Sources

Technology

20 hrs ago

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

The Outpost

Top stories

News

About