5 Sources
5 Sources
[1]
Creative Commons announces tentative support for AI 'pay-to-crawl' systems | TechCrunch
After announcing earlier this year a framework for an open AI ecosystem, the nonprofit Creative Commons has come out in favor of "pay-to-crawl" technology -- a system to automate compensation of website content when accessed by machines, like AI webcrawlers. Creative Commons (CC) is best known for spearheading the licensing movement that allows creators to share their works while retaining copyright. In July, the organization announced a plan to provide a legal and technical framework for dataset sharing between companies that control the data and the AI providers that want to train on it. Now, the nonprofit is tentatively backing pay-to-crawl systems, saying it is "cautiously supportive." "Implemented responsibly, pay-to-crawl could represent a way for websites to sustain the creation and sharing of their content, and manage substitutive uses, keeping content publicly accessible where it might otherwise not be shared or would disappear behind even more restrictive paywalls," a CC blog post said. Spearheaded by companies like Cloudflare, the idea behind pay-to-crawl would be to charge AI bots every time they scrape a site to collect its content for model training and updates. In the past, websites freely allowed webcrawlers to index their content for inclusion into search engines like Google. They benefited from this arrangement by seeing their sites listed in search results, which drove visitors and clicks. With AI technology, however, the dynamic has shifted. After a consumer gets their answer via an AI chatbot, they're unlikely to click through to the source. This shift has already been devastating for publishers by killing search traffic, and it shows no sign of letting up. A pay-to-crawl system, on the other hand, could help publishers recover from the hit AI has had on their bottom line. Plus, it could work better for smaller web publishers that don't have the pull to negotiate one-off content deals with AI providers. Major deals have been struck between companies like OpenAI and Condé Nast, Axel Springer and others; as well as between Perplexity and Gannett; Amazon and The New York Times; and Meta and various media publishers, among others. CC offered several caveats to its support for pay-to-crawl, noting that such systems could concentrate power on the web. It could also potentially block access to content for "researchers, nonprofits, cultural heritage institutions, educators, and other actors working in the public interest." It suggested a series of principles for responsible pay-to-crawl, including not making pay-to-crawl a default setting for all websites and avoiding blanket rules for the web. In addition, it said that pay-to-crawl systems should allow for throttling, not just blocking, and should preserve public interest access. They should also be open, interoperable, and built with standardized components. Cloudflare isn't the only company investing in the pay-to-crawl space. Microsoft is also building an AI marketplace for publishers, and smaller startups like ProRata.ai and TollBit have started to do so, as well. Another group called the RSL Collective announced its own spec for a new standard called Really Simple Licensing (RSL) that would dictate what parts of a website crawlers could access but would stop short of actually blocking the crawlers. Cloudflare, Akamai, and Fastly have since adopted RSL, which is backed by Yahoo, Ziff Davis, O'Reilly Media, and others.
[2]
A pay-to-scrape AI licensing standard is now official
An open licensing standard that aims to make AI companies pay for the content they vacuum up across the web is now an official specification. Really Simple Licensing 1.0 -- or RSL for short -- gives publishers the ability to dictate licensing and compensation rules to the web crawlers that visit their sites. The RSL Collective announced the standard in September with backing from Yahoo, Ziff Davis, and O'Reilly Media. It's an expansion of the robots.txt file, which outlines the parts of a website a web crawler can access. Though RSL alone can't block AI scrapers that don't pay for a license, the web infrastructure providers that support the standard can -- a list that now includes Cloudflare and Akamai, in addition to Fastly. RSL's 1.0 release lets publishers block their content from AI-powered search features, like Google's AI Mode, while maintaining a presence in traditional search results. Currently, Google doesn't give websites an individual option to opt out of AI-powered features without booting them out of traditional search, too. "RSL provides exactly that missing layer," RSL Collective cofounders Doug Leeds and Eckart Walther say in an emailed statement to The Verge. "Using RSL, Google can respect a publisher's preferences at the use case level, which means a publisher can stay fully available in traditional search, while opting out of AI training, grounding, or generative answers." Google is currently facing an investigation from the European Commission, which is looking into whether the company has violated antitrust policies by using web publishers' content in AI search features "without offering them the possibility to refuse such use of their content." The RSL Collective says more than 1,500 media organizations and brands now support RSL. In addition to Reddit, Quora, WikiHow, Stack Overflow, and Medium, publishers like The Associated Press, Vox Media (The Verge's parent company), The Guardian, Slate, BuzzFeed, and Men's Journal publisher Arena Group have also endorsed the standard. "With this release, and the support for it across the internet ecosystem, RSL 1.0 becomes the expected and trusted way to communicate how content may be used in AI systems, giving those signals real weight in both practice and legal interpretation," Leeds says. The RSL Collective also worked with the Creative Commons to add a new "contribution" payment option for nonprofit organizations and individuals behind the webpages, code repositories, and datasets that make up "the shared pool of freely available knowledge and creative work on the internet."
[3]
AI Platforms Are Paying (Some) Big Publishers, Leaving Smaller Ones Behind
An ideologically wide range of news outlets now stand to make some money off Meta's obsession with AI. CNN, Fox News, USA Today, The Daily Caller, People, Le Monde, and others have signed on to bring "real-time content on Meta AI." Partnering means paying; Meta's plans to compensate those publishers an undisclosed amount, Axios media reporter Sara Fischer confirms. It's the latest in a series of moves by the operators of AI services to pay sites for access to their content. A tracker of AI deals maintained by Columbia Journalism School's Tow Center for Digital Journalism lists 128 such arrangements between AI operators and news publishers since July 2023, including such high-profile tie-ups as OpenAI's deal with the Financial Times and Perplexity paying the Washington Post, the Los Angeles Times, and other publishers for inclusion in its Comet browser's premium service. (Tow's tracker also counts 21 lawsuits filed by publishers against AI providers in that time, including the lawsuit PCMag's parent company Ziff Davis filed against OpenAI in April 2025 alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) But all of these deals, plus similar ones with non-news sites like the content-licensing contracts Google and OpenAI inked with Reddit in 2024, have one unfortunate thing in common: They leave out smaller sites that can't afford lawyers to negotiate with the likes of Google and Meta. And small and large sites seem equally exposed to the risk of AI-enhanced search results giving web users enough information to save them from having to click through to a search result. In a study published this summer, the Pew Research Center found that Google's AI Overview search results diminished the clickthrough rate among survey respondents from 15% to 8%. Google has repeatedly said that it's not seeing an overall drop in clickthrough traffic and that AI Overview sends sites a little more "high-quality" clicks, meaning ones that result in more time spent at the site. It has yet to publish numbers documenting that second claim. Court rulings have not yielded a legal consensus about how much an AI platform should be able to reuse the work of humans. In February, one federal judge ruled that a now-defunct AI startup infringed Thomson Reuters' copyrights when it leveraged content from that firm's Westlaw reference to create a competing service. In June, another ruled that Anthropic buying books and scanning them to train its Claude AI platform met fair-use criteria, but Anthropic downloading copies of books from a trove of pirated works did not. The crawlers that read sites to provide data for training AI models can also impose bandwidth costs on those sites. In April, Wikipedia warned that an onslaught of these AI bots -- largely "automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models" -- was eating into its server costs and capacity. And the automated results of all this AI crawling and scraping can wind up harming both online creators and their former readers. A Nov. 25 Bloomberg story recounted how AI summaries of recipes often leave readers with incorrect instructions while doing enough damage to the traffic of food bloggers that one lamented that "I'm going to have to find something else to do." Breaking the Fundamental Business Model of the Internet In July, the internet-services company Cloudflare, which already lets sites using its services (even the free tier) block AI-crawler bots, announced a new "pay per crawl" feature, which lets site owners grant access to AI crawlers from sites that pay for that access. In a panel at the Web Summit conference in Lisbon in November, Cloudflare CEO Matthew Prince called it a badly needed response to an existential threat to the internet we've known. "If these new AI tools aren't generating traffic, then the fundamental business model of the internet is going to break down," Prince told his onstage interviewer, Fortune executive editor Jim Edwards, who replied that Fortune has seen AI do just that: "It's reducing readership, certainly, it's making revenue harder." Prince, however, said he'd seen a recognition among most AI developers that they can't only take: "When we have conversations with the AI companies, with one very notable exception, they are all saying we have to pay for this content." You can probably guess the exception. Calling this one company both "the great patron of the internet for the last 27 years" and "the great villain of the internet today," Prince said Google makes it impossible for sites to permit its essential web indexing but block its AI crawling using standard robots.txt files, because the same bot does both tasks. "They need to play by the same rules as everyone else and split their crawler so that search and AI are two separate things," he said. Prince then suggested that Google was open to that idea: "I guarantee you that immediately after I get offstage, I will be having this conversation with senior Google executives." Google declined to provide a comment on Prince's talk. The company does allow site owners to block Google from using their content to train its Gemini AI platform, but that does not affect AI Overviews. A separate "nosnippet" option blocks Google from displaying a brief text preview of a page's content but affects both Google's traditional search as well as its AI Overviews. Cloudflare did not name any AI companies now making payments to site owners via Pay Per Crawl, citing this feature's private-beta status. An executive with a trade group for small online newsrooms couldn't offer any details about member uptake of this option. "I do not know -- and can't get clarity on -- which if any are using the anti-crawling tool," emailed Chris Krewson, executive director of LION Publishers (the abbreviation is short for "local independent online news"). He did note that Cloudflare had tried to sell LION on adopting it, which he took as evidence of limited early adoption. Another possibility for smaller sites and solo creators could be the Really Simple Licensing standard now backed by a coalition of larger online properties including Reddit, Yahoo, and Ziff Davis, which would let sites post terms for AI use of their content -- and which could work with Cloudflare's AI bot blocking or a similar screen acting as an enforcer. Toward the end of his Web Summit panel, Prince suggested that even AI developers weary of being leapfrogged by rivals should welcome being required to pay for access-because that could let them stand out by buying better content. "What's going to differentiate them?" he asked and then shared his own answer: "Do they have access to original and unique content?"
[4]
Really Simple Licensing spec makes AI orgs pay to scrape
Publishers now have more comprehensive tools for managing automated content harvesting Most big AI providers scrape the open web, hoovering up content to improve their chatbots, which then compete with publishers for the attention of internet users. However, more AI orgs might have to pay up soon, because the Really Simple Licensing (RSL) spec has reached version 1.0, providing guidance on how to set machine-readable rules for crawlers. "Today's release of RSL 1.0 marks an inflection point for the open internet," said Eckart Walther, chair of the RSL technical steering committee, in a statement. "RSL establishes clarity, transparency, and the foundation for new economic frameworks for publishers and AI systems, ensuring that internet innovation can continue to flourish, underpinned by clear, accountable content rights." Introduced in September, RSL represents a response to the explosion of automated content harvesting intended to provide fodder for AI model training. It's intended to complement the Robots Exclusion Protocol [RFC 9309], a way for websites to declare acceptable methods of engagement through a robots.txt file. In a bid to prevent their content from being laundered for profit in an AI model, publishers are increasingly trying to negotiate licensing deals or block bot-based data gathering. Web site operators typically publish a robots.txt file at the site root to provide guidance to automated traffic. But robots.txt compliance is voluntary and many crawlers ignore the directive. RSL builds upon syndication spec RSS and the Robots Exclusion Protocol by providing a way to declare requirements for accessing and processing content, which may involve a demand for compensation. The specification includes an XML vocabulary for describing content usage, licensing, and legal terms of service. The RSL document - functionally a machine readable license - can be integrated with other web mechanisms, including robots.txt, HTTP headers, RSS feeds, and HTML link elements. It provides support for license acquisition and enforcement via the Open License Protocol (OLP), the Crawler Authorization Protocol (CAP), and the Encrypted Media Standard (EMS). The RSL 1.0 release adds new categories for the <permits> element</permits> such as "ai-all," "ai-input," and "ai-index," to accommodate more specific AI usage rules, such as allowing search engines to index content but not use it for AI search applications. It also includes a new "contribution" payment option for noncommercial organizations that want "a good faith monetary or in-kind contribution that supports the development or maintenance of the assets, or the broader content ecosystem." While RSL is similar to the Robots Exclusion Protocol in that it's not a technical access control mechanism, it provides support for publishers and partners that choose to implement paywalls and other barriers. There are various technical options to enforce the preferences expressed in RSL and robots.txt declarations for bots that fail to comply, such as network-level barriers. But sometimes legal intervention is required to halt bad behavior. Bad bots may still flout or bypass RSL requirements, but the spec's support for licensing services, encryption mechanisms, and authentication mechanisms should help publishers who choose to challenge such behavior in court. The RSL spec has been endorsed by infrastructure companies like Cloudflare and Akamai, which offer content tollbooth services for billing AI bots; by publishers like The Associated Press; by social media sites like Stack Overflow; and by micropayment biz Supertab; among others. "From what we've seen over the last couple of years and the effect that bot scraping has had on these publications, whether that be from the traffic onto their sites, the loss of the advertising revenue on those sites, et cetera, it's time for a new offering that benefits these publications and the content that they provide," Supertab director of growth, Erick McAfee told The Register in an interview. Supertab provides a payment layer for RSL and has been beta testing implementations with about a dozen customers for the past two quarters, although bots aren't actually being billed at this point. McAfee said the testing aims to validate how payments would flow if in fact the bots cmoply. "The goal is to be able in the future to provide an invoice to these LLMs and explain, 'This is the cause and this is the effect and this is the cost of what's happened.' But as of right now, we're just collecting data to show what's going on," he said. McAfee said that while he couldn't share information about specific customers, "the data is impressive in the sense that it's definitely impactful" in terms of the impact AI bots have had on site visits and reduced advertising revenue. ®
[5]
Publishers say no to AI scrapers, block bots at server level
A growing number of websites are taking steps to ban AI bot traffic so that their work isn't used as training data and their servers aren't overwhelmed by non-human users. However, some companies are ignoring the bans and scraping anyway. Online traffic analysis conducted by BuiltWith, a web metrics biz, indicates that the number of publishers trying to prevent AI bots from scraping content for use in model training has surged since July. About 5.6 million websites presently have added OpenAI's GPTBot to the disallow list in their robots.txt file, up from about 3.3 million at the start of July 2025. That's an increase of almost 70 percent. Websites can signal to visiting crawlers whether they allow automated requests to harvest information through entries in their robots.txt files. Compliance with these directives is voluntary, but repeated failure to respect these rules may come up in litigation, as it did in Reddit's scraping lawsuit against Anthropic earlier this year. Speaking of Anthropic, the company's ClaudeBot is also increasingly wearing out its welcome. ClaudeBot is now blocked at about 5.8 million websites, up from 3.2 million in early July. The company's Claude-SearchBot - used for surfacing sites in Claude search results - also faces a rising block rate. The situation is similar for AppleBot, now blocked at about 5.8 million websites, up from about 3.2 million in July. Even GoogleBot - which indexes data for search - faces growing resistance, perhaps because it's also used for the AI Overviews now surfaced atop search results. BuiltWith reports that 18 million sites now ban the bot, which would also mean that those sites could not be indexed in Google Search. As of July, about half of news sites blocked GPTBot, according to Arc XP, a publishing platform biz spun out of The Washington Post. Anthropic, OpenAI, and Google did not immediately respond to requests for comment. Anirudh Agarwal, CEO of OutreachX, a web marketing consultancy, said in an emailed statement that it's noteworthy how often GPTBot is getting turned away because that signals how publishers think about AI crawlers. If OpenAI's GPTBot is being blocked, every other AI crawler faces that possibility. Tollbit, a biz that aims to help publishers monetize AI traffic through access fees for crawlers, said in its Q2 2025 report that, in the past year, there's been a 336 percent increase in sites blocking AI crawlers. The company also said that, across all AI bots, 13.26 percent of requests ignored robots.txt directives in Q2 2025, up from 3.3 percent in Q4 2024. This alleged behavior has been challenged in court by Reddit as noted above, and in a lawsuit filed by major news publishers against Perplexity in 2024. But bot blocking efforts have become more complicated because AI firms like OpenAI and Perplexity have launched browsers that incorporate their AI models. According to the Tollbit report, "The latest AI browsers like Perplexity Comet, and devtools like Firecrawl or Browserless are indistinguishable from humans in site logs." So publishers that block Comet or the like might just be blocking human traffic. As a result, Tollbit argues, it's critical that non-human site traffic accurately identifies itself. For organizations that are not major publishers, the AI bot onslaught can be overwhelming. In October, blogging service Bear reported an outage based on AI bot traffic, a problem also noted by Belgium-based blogger Wouter Groeneveld. And developer David Gerard, who runs AI-skeptic blog Pivot-to-AI, last month wrote on Mastodon about how RationalWiki.org was having trouble keeping AI bots at bay. Will Allen, VP of product at Cloudflare, told The Register in an interview last month that the company sees "a lot of people that are out there trying to scrape large amounts of data, ignoring any robots.txt directives, and ignoring other attempts to block them." Bot traffic, said Allen, is increasing, which in and of itself isn't necessarily a bad thing. But it does mean, he said, that there are more attacks and more people trying to get around paywalls and content restrictions. Cloudflare, over the summer, launched a service called Pay per crawl in a bid to allow content owners to offer automated access for a price. Allen declined to disclose which sites have signed up to participate in the beta testing but said it's clear that new economic options would be helpful. "We have a thesis or two about how that could evolve," he said. "But really, we think there's going to be a lot of different evolution, a lot of different experimentation. And so we're keeping a pretty tight private beta for our Pay per crawl product just to really learn, from both sides of the market - people who are looking to access content at scale and people who are looking to protect content." ®
Share
Share
Copy Link
A new licensing standard aims to shift the balance of power between publishers and AI companies. Really Simple Licensing 1.0 gives websites control over how AI crawlers access their content, while data shows millions of sites are blocking bots like GPTBot. The move comes as publishers face devastating traffic losses from AI-enhanced search results that keep users from clicking through to original sources.
The Really Simple Licensing (RSL) 1.0 specification has officially launched, giving publishers new tools to enforce licensing and compensation rules for AI crawlers that scrape their content
2
.
Source: The Register
Backed by the RSL Collective and supported by web infrastructure giants Cloudflare, Akamai, and Fastly, the standard builds on the traditional robots.txt file to create machine-readable licenses that dictate how AI systems can use website content
4
. More than 1,500 media organizations and brands now support RSL, including The Associated Press, Vox Media, The Guardian, Stack Overflow, and Reddit2
.The specification addresses a critical gap in content licensing by allowing publishers to block their content from AI-powered search features like Google's AI Mode while maintaining presence in traditional search results
2
. This granular control matters because Google currently doesn't give websites an individual option to opt out of AI Overviews without losing their position in regular search results entirely.Creative Commons has announced "cautious support" for pay-to-crawl systems, marking a significant shift in how content compensation could work on the web
1
. The nonprofit, best known for spearheading open licensing, argues that these systems could help websites sustain content creation while keeping material publicly accessible rather than disappearing behind restrictive paywalls1
.
Source: The Register
Pay-to-crawl technology would charge AI bots every time they scrape a site for AI model training and updates. Cloudflare launched its "Pay per crawl" service over the summer, joining Microsoft, ProRata.ai, and TollBit in building infrastructure for automated compensation
1
. This approach could particularly benefit smaller web publishers that lack the negotiating power to strike individual content deals with AI providers like OpenAI, which has secured agreements with Condé Nast and Axel Springer1
.The number of websites blocking AI crawlers has surged dramatically. About 5.6 million websites now block OpenAI's GPTBot, up from 3.3 million in early July 2025—an increase of almost 70 percent
5
.
Source: TechCrunch
Anthropic's ClaudeBot faces similar resistance, now blocked at 5.8 million sites compared to 3.2 million in July
5
.This wave of blocking AI bots reflects growing frustration with unauthorized content scraping and its impact on web traffic. A Pew Research Center study found that AI-enhanced search results reduced clickthrough rates from 15 percent to 8 percent
3
. The shift has devastated publishers by killing search traffic, as consumers get answers from AI chatbots without clicking through to original sources1
.Related Stories
The legal landscape around AI content usage remains unsettled. Columbia Journalism School's Tow Center tracks 128 content licensing deals between AI operators and news publishers since July 2023, alongside 21 lawsuits alleging copyright infringement
3
. Court rulings have produced mixed results, with one federal judge finding copyright violations when Thomson Reuters' content was leveraged without permission, while another ruled some AI training met fair-use criteria3
.Compliance with robots.txt directives remains voluntary, and violations are increasing. According to Tollbit, 13.26 percent of AI bot requests ignored robots.txt rules in Q2 2025, up from 3.3 percent in Q4 2024
5
. AI crawlers also impose significant bandwidth costs on sites, with Wikipedia warning that automated programs scraping its image catalog were eating into server capacity3
.While major publishers secure lucrative deals—Meta recently partnered with CNN, Fox News, USA Today, and others for undisclosed compensation
3
—smaller sites struggle to negotiate with tech giants. These publishers face the same exposure to traffic losses from AI-enhanced search results but lack the legal resources to protect their interests3
.Cloudflare CEO Matthew Prince warned at Web Summit that "the fundamental business model of the internet is going to break down" if AI tools don't generate traffic to original sources
3
. He noted that most AI developers recognize they must pay for content, with one notable exception: Google, which uses the same bot for both search indexing and AI crawling, making it impossible for sites to permit one while blocking the other3
. Google currently faces a European Commission investigation into whether it has violated antitrust policies by using publishers' content in AI search features without allowing them to refuse2
.Summarized by
Navi
[1]
[4]
[5]
1
Policy and Regulation

2
Technology
3
Technology
