10 Sources
10 Sources
[1]
Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.
Leading Internet companies and publishers -- including Reddit, Yahoo, Quora, Medium, The Daily Beast, Fastly, and more -- think there may finally be a solution to end AI crawlers hammering websites to scrape content without permission or compensation. Announced Wednesday morning, the "Really Simply Licensing" (RSL) standard evolves robots.txt instructions by adding an automated licensing layer that's designed to block bots that don't fairly compensate creators for content. Free for any publisher to use starting today, the RSL standard is an open, decentralized protocol that makes clear to AI crawlers and agents the terms for licensing, usage, and compensation of any content used to train AI, a press release noted. The standard was created by the RSL Collective, which was founded by Doug Leeds, former CEO of Ask.com, and Eckart Walther, a former Yahoo vice president of products and co-creator of the RSS standard, which made it easy to syndicate content across the web. Based on the "Really Simply Syndication" (RSS) standard, RSL terms can be applied to protect any digital content, including webpages, books, videos, and datasets. The new standard supports "a range of licensing, usage, and royalty models, including free, attribution, subscription, pay-per-crawl (publishers get compensated every time an AI application crawls their content), and pay-per-inference (publishers get compensated every time an AI application uses their content to generate a response)," the press release said. Leeds told Ars that the idea to use the RSS "playbook" to roll out the RSL standard arose after he invited Walther to speak to University of California, Berkeley students at the end of last year. That's when the longtime friends with search backgrounds began pondering how AI had changed the search industry, as publishers today are forced to compete with AI outputs referencing their own content as search traffic nosedives.
[2]
RSS co-creator launches new protocol for AI data licensing | TechCrunch
In the wake of Anthropic's $1.5 billion copyright settlement, the AI industry is coming to terms with its training data problem. There are as many as 40 other pending cases that seek damages for unlicensed data -- including one that takes Midjourney to court for creating images of Superman. Without some kind of licensing system, AI companies could face an avalanche of copyright lawsuits that some worry will set the industry back permanently. Now, a group of technologists and web publishers has launched a system that would enable data licensing at massive scale -- provided AI companies take them up on it. Called Real Simple Licensing (RSL), the system is already being backed by major web publishers like Reddit, Quora and Yahoo. The question now is if that momentum will be enough to bring major AI labs to the bargaining table. According to RSL co-founder Eckart Walther, who also co-created the RSS standard, the goal was to create a training-data licensing system that could scale across the internet. "We need to have machine-readable licensing agreements for the internet," Walther told TechCrunch. "That's really what RSL solves." For years, groups like the Dataset Providers Alliance have been pushing for clearer collection practices, but RSL is the first attempt at a technical and legal infrastructure that could make it work in practice. On the technical side, the RSL Protocol lays out specific licensing terms a publisher can set for their content, whether that means AI companies need a custom license or to adopt Creative Commons provisions. Participating websites will include the terms as part of their "robots.txt" file in a prearranged format, making it straightforward to identify which data falls under which terms. On the legal side, the RSL team has established a collective licensing organization, the RSL Collective, that can negotiate terms and collect royalties, similar to ASCAP for musicians or MPLC for films. As in music and film, the goal is to give licensors a single point of contact for paying royalties, and provide rightsholders a way to set terms with dozens of potential licensors at once. A host of web publishers have already joined the collective, including Yahoo, Reddit, Medium, O'Reilly Media, Ziff Davis (owner of Mashable and Cnet), Internet Brands (owner of WebMD), People Inc. and The Daily Beast. Others, like Fastly, Quora and Adweek, are supporting the standard without joining the collective. Notably, the RSL Collective includes some publishers that already have licensing deals -- most notably Reddit, which receives an estimated $60 million a year from Google for use of its training data. There's nothing stopping companies from cutting their own deals within the RSL system, just as Taylor Swift can set special terms for licensing while still collecting royalties through ASCAP. But for publishers too small to draw their own deals, RSL's collective terms are likely to be the only option. But while it's easy enough to determine when a song has been played, AI models pose unique challenges when it comes to figuring out when royalties are due for a specific piece of training data. The issue is simplest for a product like Google's AI Search Abstracts, which draw data from the web in real time and maintain strict attribution for each fact. But if training isn't logged when it occurs, it can be nearly impossible to confirm that a given document was ingested into a LLM. It's particularly challenging if publishers ask to be paid per-inference rather than receiving a blanket fee, an option offered by one of the stock RSL licenses. Still, RSL's creators believe AI companies will be able to manage the difficulty. "Some of the licensing agreements they've already done have required them to be able to report on it, so it's possible," says Doug Leeds, a co-founder of RSL and former CEO of IAC Publishing. "It doesn't have to be perfect. It just has to be good enough to get people paid." The bigger question is whether AI companies will embrace the system. As the success of companies like ScaleAI and Mercor shows, frontier labs have no problem paying for data, but the web has traditionally been seen as a source for cheap, low-quality data. With datasets like the Common Crawl already available, it may be a challenge to extract royalties from something labs are used to getting for free. And as the recent dustup between CloudFlare and Perplexity shows, it's not straightforward to tell the difference between web-scraping and machine-enhanced browsing. When I put the question to Leeds, he pointed to recent comments from AI leaders calling for a system like RSL -- most notably from Sundar Pichai at last year's Dealbook Summit. Whether the calls for a licensing system are earnest or not, the RSL team plans to hold them to it. "They have said outwardly to everyone, something like this needs to exist," Leeds told me. "We need a protocol. We need a system."
[3]
Online Media Brands Hope a New Protocol Will Stop Unwanted AI Crawlers
Imad is a senior reporter covering Google and internet culture. Hailing from Texas, Imad started his journalism career in 2013 and has amassed bylines with The New York Times, The Washington Post, ESPN, Tom's Guide and Wired, among others. Online media brands, including Yahoo, Quora and Medium, are taking a new step to prevent AI companies from copying and using their content to train models without their permission. The publishers, including CNET's parent company Ziff Davis, see this new tool, called RSL, as another way to ensure large AI developers don't use their work without payment or compensation -- an issue that's already led to a host of lawsuits. RSL, which stands for Really Simple Licensing, is inspired by Really Simple Syndication, a longtime web standard that provides up-to-date and automatic content updates in a computer-readable format. Like RSS, RSL is open, decentralized and can work with pretty much any piece of content online, including web pages, videos and datasets. Right now, when an AI company's roving internet robot, known as a crawler, wants to suck up the information on a site, it has to go through robots.txt, which acts as a basic entry or non-entry door. AI companies have found ways around robots.txt or ignored it altogether and have subsequently been sued. The goal for RSL is to be a more robust layer of tech to deal with AI crawlers, which now account for more than half of all internet traffic. (Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) "RSL builds directly on the legacy of RSS, providing the missing licensing layer for the AI-first Internet," Tim O'Reilly, CEO of O'Reilly Media, said in a press release. "It ensures that the creators and publishers who fuel AI innovation are not just part of the conversation but fairly compensated for the value they create." Brands that have signed onto RSL include Reddit, People, Internet Brands, Fastly, wikiHow, O'Reilly, Daily Beast, The MIT Press, Miso, Adweek, Ranker, Evolve Media and Raptive. "If AI is trained on our writers' work, then it needs to pay for that work," Medium CEO Tony Stubblebine said in a press release. "Right now, AI runs on stolen content. Adopting this RSL Standard is how we force those AI companies to either pay for what they use, stop using it, or shut down." The advent of RSL comes as online web traffic has cratered with changes to Google and the preponderance of AI. Google's integrated AI-generated answers at the top of Google Search have been criticized by publishers as taking away from potential clicks they would have received otherwise. Google contends that AI Overviews send "higher quality clicks" to sites, people who are more engaged and stay on sites longer. AI chatbots like ChatGPT also help with research and synthesis, meaning people don't have to jump around various sites to pull together pieces of information in the same way they did before. Overall, publishers are losing up to 25% of traffic due to AI platforms, according to a report from Infactory. "Widespread adoption of the RSL Standard will protect the integrity of original work and accelerate a mutually beneficial framework for publishers and AI providers," Ziff Davis CEO Vivek Shah said. In response, publishers are suing AI companies or inking licensing deals. In other instances, sites are turning to services like Tollbit, which aim to charge AI crawlers every time they ask to examine a site's contents. Content delivery networks like Cloudflare, which help ensure people have quick access to sites online, are blocking AI crawlers outright. RSL co-founder Eckart Walther said the RSL standard and efforts like that by Cloudflare are complementary, with many of the same media companies participating in both. Walther compared the tools like Cloudflare to bouncers that protect a website from unwanted crawlers, while RSL just allows the crawler to understand the rules and the price of admission. "These compensation methods can also work together. For example, a publisher might want to charge for crawling their content, and then also require a royalty payment every time the content is used by an AI model to reply to a question," Walther said in an email to CNET.
[4]
AI's free web scraping days may be over, thanks to this new licensing protocol
Media companies announced a new web protocol: RSL.RSL aims to put publishers back in the driver's seat.The RSL Collective will attempt to set pricing for content. AI companies are capturing as much content as possible from websites while also extracting information. Now, several heavyweight publishers and tech companies -- Reddit, Yahoo, People, O'Reilly Media, Medium, and Ziff Davis (ZDNET's parent company) -- have developed a response: the Really Simple Licensing (RSL) standard. You can think of RSL as Really Simple Syndication's (RSS) younger, tougher brother. While RSS is about syndication, getting your words, stories, and videos out onto the wider web, RSL says: "If you're an AI crawler gobbling up my content, you don't just get to eat for free anymore." Also: AI's not 'reasoning' at all - how this team debunked the industry hype The idea behind RSL is brutally simple. Instead of the old robots.txt file -- which only said, "yes, you can crawl me," or "no, you can't," and which AI companies often ignore -- publishers can now add something new: machine-readable licensing terms. Want an attribution? You can demand it. Want payment every time an AI crawler ingests your work, or even every time it spits out an answer powered by your article? Yep, there's a tag for that too. This approach allows publishers to define whether their content is free to crawl, requires a subscription, or will cost "per inference," that is, every time ChatGPT, Gemini, or any other model uses content to generate a reply. The key capabilities of RSL include: It's a clever fix for a complex problem. As Tim O'Reilly, the O'Reilly Media CEO and one of the RSL initiative's high-profile backers, said: "RSS was critical to the internet's evolution...but today, as AI systems absorb and repurpose that same content without permission or compensation, the rules need to evolve. RSL is that evolution." O'Reilly's right. RSS helped the early web scale, whether blogs, news syndication, or podcasts. But today's web isn't just competing for human eyeballs. The web is now competing to supply the training and reasoning fuel for AI models that, so far, aren't exactly paying the bills for the sites they're built on. Of course, tech is one thing; business is another. That's where the RSL Collective comes in. Modeled on music's ASCAP and BMI, the nonprofit is essentially a rights-management clearinghouse for publishers and creators. Join for free, pool your rights, and let the Collective negotiate with AI companies to ensure you're compensated. Also: DeepSeek may be about to shake up the AI world again - what we know As anyone in publishing knows, a lone freelancer, or most media outlets for that matter, has about as much leverage against the likes of OpenAI or Google as a soap bubble in a wind tunnel. But a collective that represents "the millions" of online creators suddenly has some bargaining power. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) Let's step back. For the last few years, AI has been snacking on the internet's content buffet with zero cover charge. That approach worked when the web's economics were primarily driven by advertising. However, those days are history. The old web ad model has left publishers gutted while generative AI companies raise billions in funding. So, RSL wants to bolt a licensing framework directly into the web's plumbing. And because RSL is an open protocol, just like RSS, anyone can use it. From a giant outlet like Yahoo to a niche recipe blogger, RSL allows web publishers to spell out what they want in return when AI comes crawling. Also: 5 ways to fill the AI skills gap in your business The work of guiding RSL falls to the RSL Technical Steering Committee, which reads like a who's who of the web's protocol architects: Eckart Walther, co-author of RSS; RV Guha, Schema.org and RSS; Tim O'Reilly; Stephane Koenig, Yahoo; and Simon Wistow, Fastly. The web has always run on invisible standards such as HTTP, HTML, RSS, and robots.txt. In Web 1.0, social contracts were written into code. If RSL catches on, it may be the next layer in that lineage: the one that finally gives human creators a fighting chance in the AI economy. And maybe, just maybe, RSL will stop the AI feast from becoming an all-you-can-eat buffet with no one left to cook.
[5]
Reddit, Ziff Davis Back New Idea to Stop AI From Ruining the Internet
Don't miss out on our latest stories. Add PCMag as a preferred source on Google. It's safe to say AI is reinventing the internet. But at the same time, it threatens the foundation upon which it stands: Information. Chatbots consume and regurgitate information from across the web, but they lack a standardized business model to compensate sources. That means those sources could one day dry up, leaving less information for the always-hungry AI, weakening its output. Enter the Real Simple Licensing (RSL) Standard, a new tech-based licensing solution for the "AI-first internet," as RSL puts it. It's backed by Reddit, Yahoo!, Ziff Davis (PCMag's parent company), People, Medium, WikiHow, Quora, Adweek, and more. The RSL Standard would allow websites and individual creators to set terms for using their content -- from written work to videos, web pages, images, and datasets -- before ChatGPT, Claude, Google, or any other AI system surfaces it in chatbot responses. Anyone can sign up for free by joining the RSL Collective, a website where content creators can set their terms and, ideally, see the money flow in. "Today, there's really no way to say for a website to say, 'Hey, I want Google AI Overviews off [for my content] unless you can compensate me for that lost revenue, which is not unreasonable," says Eckart Walther, co-founder of the RSL Standard and one of the original creators of RSS (Really Simple Syndication), which RSL is based on. Doug Leeds, former CEO of IAC Publishing and Ask.com, is the other co-founder. No AI companies have agreed to honor RSL yet, which Walther and Leeds are working on. But AI companies are "asking for it," Leeds says. They're saying there needs to be a better licensing structure." The more websites and creators that sign up for the RSL Collective, the louder the message is to AI companies that it's the right solution, they say. OpenAI has crafted its own licensing structure, but it's based on a flat fee rather than paying the content owner every time they use its original work. Reddit, for example, has struck a $60 million AI licensing deal with Google, and another with OpenAI for an undisclosed amount. Given that context, it's significant that even Reddit backs the RSL Standard. One reason is that the flat fee does not reflect how often AI systems use its content, so it could leave money on the table. Reddit just hopes the deal is the fair value and would need to renegotiate it if not. "The RSL Standard gives publishers and platforms a clear, scalable way to set licensing terms in the AI era," says Steve Huffman, CEO of Reddit. "The RSL Collective offers a path to do it together. Reddit supports both as important steps toward protecting the open web and the communities that make it thrive." Ziff Davis has not yet struck a licensing deal with OpenAI and is suing the company for ignoring its instructions not to crawl its content until it has one. The company claims OpenAI's web crawlers are ignoring the "robots.txt" file in the backend of its sites, which says they are not allowed to crawl it. Robots.txt has been around "since the old days," Walther says. It's clearly not holding up today. "Widespread adoption of the RSL Standard will protect the integrity of original work and accelerate a mutually beneficial framework for publishers and AI providers," says Vivek Shah, CEO of Ziff Davis. RSL is designed to work with a similar system from Cloudflare that debuted in July. Leeds' analogy for how they fit together is to imagine Cloudflare as "a bouncer," which says whether you can get in or not. RSL would add another layer where publishers can set their terms of entry. It's like the ID the bouncer checks, making sure the patron, an AI in this case, meets the terms of entry. "The one thing that really distinguishes us is that we're doing this as a nonprofit," Leeds says. "We come from search, we come from media. We know the problems out there, and they're not how to make some investor billions of dollars. They're how to compensate people who are doing the work, and that's what we're about."
[6]
New RSL spec wants AI crawlers to show a license or pay
Content creation and delivery companies have introduced a digital licensing mechanism in an effort to compensate media makers when AI companies use their work. The Really Simple Licensing (RSL) standard is intended to provide websites with a programmatic way to present web crawlers with licensing terms, and to gate site access based on license compliance, which may require payment. RSL attempts to improve upon robots.txt, a voluntary way for websites to declare how bots should interact with site content. Crawlers frequently ignore robots.txt, and some disguise themselves to avoid being blocked. So RSL offers a compliance mechanism that requires bots to present an authorization header as part of the network negotiation process. "With RSL, websites can enforce stricter control over their content usage by blocking crawlers that have not obtained a free or paid license from an RSL License Server," the documentation explains. "When a crawler requests a page that is managed by an RSL license from your website, it must include a valid RSL License Token for the page in the Authorization header using the new proposed License authentication scheme defined in RFC 7235 HTTP Authentication." RSL is based on Really Simple Syndication (RSS), a popular decentralized web protocol. "RSL builds directly on the legacy of RSS, providing the missing licensing layer for the AI-first Internet," said Tim O'Reilly, CEO of O'Reilly Media, one of the organizations steering the project, in a statement. "It ensures that the creators and publishers who fuel AI innovation are not just part of the conversation but fairly compensated for the value they create." RSL is administered by a newly formed nonprofit, the RSL Collective, a rights collective similar to ASCAP and BMI in the music industry. Along with O'Reilly Media, RSL and the RSL Collective are backed by Reddit, People Inc., Yahoo, Internet Brands, Ziff Davis, wikiHow, Medium, The Daily Beast, Miso.AI, Raptive, Ranker, and Evolve Media. Fastly, Quora, and Adweek have endorsed the RSL standard but are not participating in the RSL Collective. The technical aspects of the RSL Standard are overseen by the RSL Technical Steering Committee, staffed by representatives from participating publishing and technology companies. AI vendors like Anthropic, Google, Microsoft, OpenAI, and others rely on web crawling scripts or bots to ingest website content, which then gets used for training AI models or other applications like search. Initially, this was done without permission, sometimes using unauthorized datasets, sparking copyright disputes, or based on assumptions about fair use that continue to be litigated. But following the debut of ChatGPT in November 2022 and the ensuing expectation that search will be subsumed by AI, content owners have become more concerned that their commercial assets are being co-opted without compensation. Large publishers like the New York Times and social media sites like Reddit have negotiated specific deals that provide AI vendors with access to their content, and have sued in court when model makers have ignored their wishes. Amid reports that AI-based search services like Google's AI Overviews are starving websites of visitor traffic - a claim Google has denied even as it acknowledges the decline of the web in court - efforts to obtain compensation from AI vendors have led to web tollbooth services like Cloudflare's Pay per crawl. "RSL complements the Cloudflare announcement by enabling publishers to define rules of how crawlers legally license (including paying per crawl) their content and get unblocked," RSL co-founder Eckart Walther told The Register in an email. "Most publishers that support RSL also participated in the Cloudflare AI blocking and Pay-per-Crawl announcement, and both approaches can work together to ensure that publishers can assert their content rights and receive fair compensation from AI companies." "Fair compensation" in this instance will be determined by RSL Collective members. "The RSL Collective is a nonprofit Collective Rights Organization that is guided by the priorities of its members - it negotiates on behalf of its members, but the definition of 'fair' will be determined by an open, transparent process that needs to be approved by members," explained Walther. "As a nonprofit, the RSL Collective passes all royalties to its members, minus any costs needed to operate the service." RSL, according to its creators, supports a variety of licensing, usage, and royalty models, including free, attribution, subscription, pay-per-crawl, and pay-per-inference. Since RSL is simply a standard that defines the criteria for admission, the challenge of separating human visitors from bots and forcing bots to comply with terms has been left to internet service providers, particularly content delivery networks like Fastly, Akamai, or Cloudflare, and custom crawler detection solutions. "While there are certainly bad actors that will not follow the rules set by publishers and might need to be blocked by solutions developed by companies like Fastly and Cloudflare, we believe that the majority of large AI firms understand, and have publicly spoken, about the need for fair compensation for the publishing industry," said Walther. "The nonprofit RSL collective exists to make this process dramatically simpler for AI firms by pooling licensing rights for millions of publishers and creators, the same way that large music distributors are able to license large music catalogs through music collective rights organizations." ®
[7]
Publishers are fighting back against AI with a new web protocol - is it too late?
Media companies announced a new web protocol: RSL.RSL aims to put publishers back in the driver's seat.The RSL Collective will attempt to set pricing for content. AI companies are capturing as much content as possible from websites while also extracting information. Now, several heavyweight publishers and tech companies -- Reddit, Yahoo, People, O'Reilly Media, Medium, and Ziff Davis (ZDNET's parent company) -- have developed a response: the Really Simple Licensing (RSL) standard. You can think of RSL as Really Simple Syndication's (RSS) younger, tougher brother. While RSS is about syndication, getting your words, stories, and videos out onto the wider web, RSL says: "If you're an AI crawler gobbling up my content, you don't just get to eat for free anymore." Also: AI's not 'reasoning' at all - how this team debunked the industry hype The idea behind RSL is brutally simple. Instead of the old robots.txt file -- which only said, "yes, you can crawl me," or "no, you can't," and which AI companies often ignore -- publishers can now add something new: machine-readable licensing terms. Want an attribution? You can demand it. Want payment every time an AI crawler ingests your work, or even every time it spits out an answer powered by your article? Yep, there's a tag for that too. This approach allows publishers to define whether their content is free to crawl, requires a subscription, or will cost "per inference," that is, every time ChatGPT, Gemini, or any other model uses content to generate a reply. The key capabilities of RSL include: It's a clever fix for a complex problem. As Tim O'Reilly, the O'Reilly Media CEO and one of the RSL initiative's high-profile backers, said: "RSS was critical to the internet's evolution...but today, as AI systems absorb and repurpose that same content without permission or compensation, the rules need to evolve. RSL is that evolution." O'Reilly's right. RSS helped the early web scale, whether blogs, news syndication, or podcasts. But today's web isn't just competing for human eyeballs. The web is now competing to supply the training and reasoning fuel for AI models that, so far, aren't exactly paying the bills for the sites they're built on. Of course, tech is one thing; business is another. That's where the RSL Collective comes in. Modeled on music's ASCAP and BMI, the nonprofit is essentially a rights-management clearinghouse for publishers and creators. Join for free, pool your rights, and let the Collective negotiate with AI companies to ensure you're compensated. Also: DeepSeek may be about to shake up the AI world again - what we know As anyone in publishing knows, a lone freelancer, or most media outlets for that matter, has about as much leverage against the likes of OpenAI or Google as a soap bubble in a wind tunnel. But a collective that represents "the millions" of online creators suddenly has some bargaining power. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) Let's step back. For the last few years, AI has been snacking on the internet's content buffet with zero cover charge. That approach worked when the web's economics were primarily driven by advertising. However, those days are history. The old web ad model has left publishers gutted while generative AI companies raise billions in funding. So, RSL wants to bolt a licensing framework directly into the web's plumbing. And because RSL is an open protocol, just like RSS, anyone can use it. From a giant outlet like Yahoo to a niche recipe blogger, RSL allows web publishers to spell out what they want in return when AI comes crawling. Also: 5 ways to fill the AI skills gap in your business The work of guiding RSL falls to the RSL Technical Steering Committee, which reads like a who's who of the web's protocol architects: Eckart Walther, co-author of RSS; RV Guha, Schema.org and RSS; Tim O'Reilly; Stephane Koenig, Yahoo; and Simon Wistow, Fastly. The web has always run on invisible standards such as HTTP, HTML, RSS, and robots.txt. In Web 1.0, social contracts were written into code. If RSL catches on, it may be the next layer in that lineage: the one that finally gives human creators a fighting chance in the AI economy. And maybe, just maybe, RSL will stop the AI feast from becoming an all-you-can-eat buffet with no one left to cook.
[8]
Reddit, Yahoo, Medium and more are adopting a new licensing standard to get compensated for AI scraping
With web publishers in crisis, a new open standard lets them set the ground rules for AI scrapers. (Or, at least it will try.) The new Really Simple Licensing (RSL) standard creates terms that participants expect AI companies to abide by. Although enforcement is an open question, it can't hurt that some heavy hitters back it. Among others, the list includes Reddit, Yahoo (Engadget's parent company), Medium and People Inc. RSL adds licensing terms to the robots.txt protocol, the simple file that provides instructions for web crawlers. Supported licensing options include free, attribution, subscription, pay-per-crawl and pay-per-inference. (The latter means AI companies only pay publishers when the content is used to generate a response.) Launching alongside the standard is a new managing nonprofit, the RSL Collective. It views itself as an equivalent of nonprofits like ASCAP and BMI, which manage music industry royalties. The new group says its standard can "establish fair market prices and strengthen negotiation leverage for all publishers." Participating brands include plenty of internet old-schoolers. Reddit, People Inc., Yahoo, Internet Brands, Ziff Davis, wikiHow, O'Reilly Media, Medium, The Daily Beast, Miso.AI, Raptive, Ranker and Evolve Media are all on board. Former Ask.com CEO Doug Leeds and RSS co-creator Eckart Walther lead the group. "The RSL Standard gives publishers and platforms a clear, scalable way to set licensing terms in the AI era," Reddit CEO Steve Huffman wrote in a press release. "The RSL Collective offers a path to do it together. Reddit supports both as important steps toward protecting the open web and the communities that make it thrive." (It's worth noting that Reddit has licensing deals with OpenAI and Google.) It's unclear whether AI companies will honor the standard. After all, they've been known to simply ignore robots.txt instructions. But the group believes its terms will be legally enforceable. In an interview with Ars Technica, Leeds pointed to Anthropic's recent $1.5 billion settlement, suggesting "there's real money at stake" for AI companies that don't train "legitimately." (However, that settlement is up in the air after a judge rejected it.) Leeds told The Verge that the standard's collective nature could also help spread legal costs, making challenges to violations more feasible. As for technical enforcement, the RSL standard can't block bots on its own. For that, the group is partnering with the cloud company Fastly, which can act as a sort of gatekeeper. (Perhaps Cloudflare, which recently launched a pay-per-crawl system, could eventually play a part, too.) Leeds said Fastly could serve as "the bouncer at the door to the club." Leeds suggested to Ars that there are incentives for AI companies, too. Financially, it could be simpler for them than inking individual licensing deals. It could prevent a problem in AI content: using multiple sources for an answer to avoid using too much from any one. If content is legally licensed, the AI app can simply use the best source, which provides the user with a higher-quality answer and minimizes the risk of hallucinations. He also referenced complaints from AI companies that there's no effective means of licensing web-wide content. "We have listened to them, and what we've heard them say is... we need a new protocol," Leeds told Ars Technica. "With the RSL standard, AI firms get a "scalable way to get all the content" they want, while setting an incentive that they'll only have to pay for the best content that their models actually reference. If they're using it, they pay for it, and if they're not using it, they don't pay for it."
[9]
The AI-Scraping Free-for-All Is Coming to an End
You can divide the recent history of LLM data scraping into a few phases. There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts. Once apps like ChatGPT became popular and companies started commercializing models, the matter of training data became instantly and extremely contentious. Authors, filmmakers, musicians, and major publishers and internet companies started calling out AI firms and filing lawsuits. OpenAI started making individual deals with publishers and platforms -- including Reddit and New York's owner company, Vox Media -- to ensure ongoing access to data for training and up-to-date chat content, while other companies, including Google and Amazon, entered into licensing deals of their own. Despite these deals and legal battles, however, AI scraping became only more widespread and brazen, leaving the rest of the web to wonder what, exactly, is supposed to happen next. They're up against sophisticated actors. Lavishly funded start-ups and tech megafirms are looking for high-quality data wherever they can find it, offline and on, and web scraping has turned into an arms race. There are scrapers masquerading as search engines or regular users, and blocked companies are building undercover crawlers. Website operators, accustomed to having at least nominal control over whether search engines index their content, are seeing the same thing in their data: swarms of voracious machines making constant attempts to harvest their content, spamming them with billions of requests. Internet infrastructure providers are saying the same thing: AI crawlers are going for broke. A leaked list of sites allegedly scraped by Meta, obtained by Drop Site News, includes "copyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers." This is neither surprising nor unique to one company. It's closer to industry-standard practice. For decades, the most obvious reason to crawl the web was to build a useful index or, later, a search engine like Google. A Google crawl meant you had a chance to show up in search results and actual people might visit your website. AI crawlers offer a different proposition. They come, they crawl, and they copy. Then they use that copied data to build products that in many cases compete with their sources (see: Wikipedia or any news site) and at most offer in return footnoted links few people will follow (see: ChatGPT Search and Google's AI Mode). For an online-publishing ecosystem already teetering on the edge of collapse, such an arrangement looks profoundly grim. AI firms scraped the web to build models that will continue to scrape the web until there's nothing left. In June, Cloudflare, an internet infrastructure firm that handles a significant portion of online traffic, announced a set of tools for tracking AI scraping and plans to build a "marketplace" that would allow sites to set prices for "accessing and taking their content to ingest into these systems." This week, a group of online organizations and websites -- including Reddit, Medium, Quora, and Cloudflare competitor Fastly -- announced the RSL standard, short for Really Simply Licensing (a reference to RSS, or Really Simple Syndication, some co-creators of which are involved in the effort). The idea is simple: With search engines, publishers could indicate whether they wanted to be indexed, and major search engines usually obliged; now, under more antagonistic circumstances, anyone who hosts content will be able to indicate not just whether the content can be scraped but how it should be attributed and, crucially, how much they want to charge for its use, either individually or as part of a coordinated group. As far as getting major AI firms to pay up, not to mention the hundreds of smaller firms that are also scraping, RSL is clearly an aspirational effort, and I doubt the first step here is for Meta or OpenAI to instantly cave and start paying royalties to WebMD. Combined with the ability to use services like Cloudflare and Fastly to more effectively block AI firms, though, it does mark the beginning of a potentially major change. For most websites, AI crawling has so far been a net negative, and there isn't much to lose by shutting it down (with the exception of Google, which crawls for its Search and AI products using the same tools). Now, with the backing of internet infrastructure firms that can actually keep pace with big tech's scraping tactics, they can. (Tech giants haven't been above scraping one another's content, but they're far better equipped to stop it if they want to.) A world in which a majority of public websites become invisible to AI firms by default is a world in which firms that have depended on relatively unfettered access to the web could start hurting for up-to-date information, be it breaking news, fresh research, new products, or just ambient culture and memes. They may not be inclined to pay everyone, but they may eventually be forced to pay someone, through RSL or otherwise.
[10]
Reddit to Yahoo: Why RSL AI license is getting traction with online publishers
The battle over who controls online content - and who gets paid for it - is heating up. For years, AI companies have scraped the web to feed their models, often without permission or compensation. Publishers, meanwhile, have watched their work fuel billion-dollar technologies while struggling to protect their own revenues. Now, a new licensing framework called Really Simple Licensing (RSL) is stepping into that gap. Backed by platforms from Reddit to Yahoo, the standard aims to give publishers more control and force AI developers to pay for what they use. Also read: Anthropic vs OpenAI: What Microsoft's AI diversification means for its future The Really Simple Licensing (RSL) standard is an attempt to reshape how online content is accessed and used in the age of artificial intelligence. Inspired by the long-standing robots.txt protocol, which tells web crawlers which parts of a site they may or may not access, RSL goes much further. It lets publishers specify not only whether AI systems can crawl their sites, but also the terms under which that content can be used. Crucially, it introduces the possibility of licensing fees and royalties when AI models train on or generate outputs from a publisher's material. In other words, it's robots.txt with teeth, a protocol designed to create accountability and payment mechanisms where previously there were none. At its core, RSL embeds licensing information directly into a site's metadata, making the rules machine-readable. AI developers that want to crawl a site must check those instructions and abide by them, whether that means paying a subscription fee, agreeing to a pay-per-crawl model, or even paying per inference when content is used to generate an answer. The RSL Collective, a nonprofit body overseeing the standard, also envisions publishers banding together to negotiate terms more effectively, much like licensing groups in the music industry. Technical enforcement is expected to be aided by infrastructure providers such as Fastly, which can block or allow access depending on whether bots have properly identified themselves and agreed to the site's licensing rules. Also read: OpenAI and AI filmmaking: Why Critterz could change animation forever The early support for RSL includes some of the internet's biggest content platforms and publishers. Reddit, Yahoo, and Medium have signed on, along with Quora, wikiHow, O'Reilly Media, and Ziff Davis, which owns sites like CNET and Mashable. This backing is significant because it brings together a wide spectrum of content sources: user-generated forums, knowledge platforms, professional publishers, and digital media outlets. Their participation signals a recognition that without collective action, publishers risk seeing their work continue to fuel AI systems without compensation or consent. For online publishers, the RSL standard represents both an opportunity and a safeguard. It offers the promise of new revenue streams by charging AI companies for access and use, something that could be especially valuable at a time when traditional digital advertising revenue is under strain. It also gives publishers more granular control, letting them decide under what circumstances their content can be used and setting boundaries around AI training and inference. At the same time, RSL provides smaller publishers with a way to stand alongside larger players by joining the Collective, reducing the need for one-on-one negotiations with powerful AI companies. The challenge, however, lies in enforcement: if AI developers simply ignore the standard, as some have done with robots.txt in the past, publishers will be forced to rely on legal and technical measures to defend their rights. For AI developers, the rise of RSL introduces both complexity and accountability. Where once the web could be treated as a nearly free and open dataset, now developers must contend with licensing terms that vary across sites. They may need to pay fees, report usage, and develop tools to ensure they are compliant with pay-per-inference models. This could make training large language models more expensive and technically challenging. On the other hand, it also provides a clearer framework for lawful and ethical use of data, reducing the risk of lawsuits or public backlash. If widely adopted, RSL could offer AI developers something they currently lack: a standard way to license data at scale. The timing of RSL's emergence is no accident. Over the past year, lawsuits and public debates have mounted over whether AI companies are unfairly profiting from creative work they did not produce. Publishers and creators are increasingly demanding compensation, while regulators are beginning to scrutinize AI training practices. RSL arrives as both a practical technical solution and a political statement: it allows publishers to say, "You may use our content, but only on our terms." The fact that major platforms like Reddit and Yahoo are backing it gives the standard legitimacy, and the involvement of infrastructure players like Fastly suggests a path toward real enforcement. Whether RSL becomes a cornerstone of AI licensing or fades into the background will depend largely on adoption. If AI giants like OpenAI, Anthropic, and Google choose to recognize and comply with the standard, it could reshape the economics of web data almost overnight. If they ignore it, RSL may struggle to gain traction without significant legal reinforcement. What is clear, however, is that the balance of power between AI companies and online publishers is shifting. For the first time, there is a unified technical standard designed to make AI companies pay for the content they rely on.
Share
Share
Copy Link
Major internet companies and publishers introduce RSL, a new standard designed to regulate AI's use of web content. This protocol aims to ensure fair compensation for content creators and publishers in the AI era.
In a groundbreaking move, leading internet companies and publishers have introduced the 'Really Simple Licensing' (RSL) standard, a new protocol designed to revolutionize how AI companies access and use web content for training their models
1
2
. This initiative comes in response to the growing concerns about AI companies using content without permission or compensation, which has led to numerous lawsuits and a potential crisis in the AI industry2
.Source: Digit
RSL, inspired by the 'Really Simple Syndication' (RSS) standard, is an open, decentralized protocol that allows publishers to set clear terms for licensing, usage, and compensation of their content
1
3
. It works by adding a licensing layer to the existing robots.txt file, enabling publishers to specify conditions such as attribution requirements, subscription models, pay-per-crawl, or pay-per-inference arrangements1
4
.Source: TechCrunch
The RSL standard has garnered support from major web publishers and tech companies, including Reddit, Yahoo, Quora, Medium, The Daily Beast, Fastly, and Ziff Davis
1
2
5
. The initiative was co-founded by Doug Leeds, former CEO of Ask.com, and Eckart Walther, a former Yahoo vice president and co-creator of the RSS standard1
.To facilitate negotiations and royalty collection, the RSL team has established the RSL Collective, a nonprofit organization modeled after ASCAP for musicians
2
5
. This collective aims to provide a unified platform for content creators to set terms and potentially receive compensation for their work used in AI training2
.Source: The Register
Related Stories
While RSL offers a promising solution, it faces challenges in implementation. Determining when royalties are due for specific pieces of training data can be complex, especially for large language models
2
. Additionally, the success of RSL depends on AI companies' willingness to adopt the system, which may require a shift in their approach to data acquisition2
4
.The introduction of RSL marks a significant step towards balancing AI innovation with fair compensation for content creators. As Tim O'Reilly, CEO of O'Reilly Media, stated, 'RSL builds directly on the legacy of RSS, providing the missing licensing layer for the AI-first Internet'
3
. If successful, RSL could reshape the relationship between AI companies and content providers, potentially setting a new standard for ethical AI development and usage in the digital age4
5
.Summarized by
Navi