3 Sources
[1]
News Organizations Push Back Against Web Archive Used For AI
Major news organizations, including CNN, NBC and USA Today, have joined an effort to curb the storage of their content in a web archive used by artificial intelligence companies for training chatbots. Those outlets are part of a group of 20 publishers who have opted out of having their content saved in an online repository maintained by the nonprofit Common Crawl, according to a letter reviewed by Bloomberg. The News/Media Alliance, which represents newspapers and magazines, sent the letter Wednesday to Common Crawl urging them to honor publishers' requests to remove content from dozens of their websites and prohibit the unauthorized use of their work, including for AI purposes. "AI developers, including some of the most powerful tech companies on earth, are in fact using Common Crawl's content to power their commercial AI tools and products - causing significant harm to our members," the nonprofit wrote. Common Crawl declined to comment on the letter. Danielle Coffey, president and chief executive officer of News/Media Alliance, said she found out about the rampant use of Common Crawl data in AI training about three years ago. "We do view this type of use of our content without permission as a copyright violation," she said in an interview. Founded in 2007, Common Crawl uses a bot to traverse the web and download data in order to create a digital archive that anyone can access. Some companies, including OpenAI, Google and Meta Platforms Inc. have usedBloomberg Terminal this trove of content to develop chatbots like ChatGPT. Common Crawl has also received donations from AI companies, including OpenAI and Anthropic PBC. News outlets have been concerned that chatbots and AI-powered search results will steal their content, cut into web traffic and cost them valuable advertising dollars. New York Times Co. sued Microsoft Corp. and OpenAI over alleged copyright infringement that caused billions of dollars in damages. The newspaper claimed that its news articles were scraped from the web and copied almost verbatim in chatbot responses, noting that Common Crawl contributed data for AI training. Other companies, including Axel Springer, Condé Nast and Hearst have signed lucrative licensing deals that allow companies to use their content in AI tools. Common Crawl established an opt-out registry that allows publishers to request that their content be excluded from its web crawls. However, an investigationBloomberg Terminal from the Atlantic magazine in November claimed that Common Crawl doesn't always honor these requests and that it circumvents website paywalls to scrape content, a practice the nonprofit says it doesn't do. Common Crawl denied these allegations in a blog post and said it "has always operated in good faith with publishers."
[2]
News publishers are blocking the Internet Archive's Wayback Machine
The New York Times, CNN, USA Today, The Guardian, and at least 241 other news organisations across nine countries have moved to restrict the Archive's crawlers, a decision the Archive's own director has called being 'collateral damage' in a war that is not really about them. The Internet Archive has preserved more than one trillion web pages since 1996. Courts cite it. Journalists use it to prove articles were edited after publication. Historians treat it as a primary source. It is, by most measures, one of the most significant public information infrastructure projects of the internet era. And it is now being systematically blocked by the news publishers whose work it has preserved, because of a problem those publishers are genuinely not wrong about: AI companies are using archived news content to train models without permission or payment. According to an analysis by AI-detection startup Originality AI, 23 major news publications are blocking ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine. In total, 241 news sites across nine countries explicitly disallow at least one of the Archive's four crawling bots. USA Today Co., the largest newspaper publisher in the US, accounts for a large share of the blocked sites, effectively removing hundreds of local publications from the historical record. The New York Times implemented what Wayback Machine director Mark Graham described as a 'hard block' starting in late 2025. The news organisations' argument is coherent even if its consequences are troubling. AI companies training large language models need vast quantities of high-quality text. Archived news content is exactly that: structured, dated, attributed, high-quality writing accumulated over decades. The Internet Archive's Wayback Machine makes enormous quantities of that content accessible via API and URL interface, an ideal source for model training pipelines. A 2023 Washington Post analysis found that data from the Internet Archive had appeared in major AI training datasets. For publishers already engaged in copyright lawsuits against OpenAI, Perplexity, and others, the Archive is a gap in their defences. "The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us," said Graham James, a Times spokesperson. "The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission." The Guardian, which has been more cautious, limited rather than fully blocked the Archive's access after its own logs revealed the Archive was a frequent crawler. Robert Hahn, head of business affairs at The Guardian, expressed particular concern about the Archive's APIs. "A lot of these AI businesses are looking for readily available, structured databases of content," he said. "The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP." Mark Graham, the Wayback Machine's director, has been consistent in calling this situation exactly what it is. "We are collateral damage," he said. The Archive has taken steps of its own: it rate-limits bulk downloads, blocks or prevents bulk downloading of certain sites' material, and maintains controls to limit large-scale automated extraction. Graham argues this means the publishers' rationale for blocking the Archive's crawlers is "unfounded", the risk is from AI companies accessing archived material through the Archive's interfaces, which the Archive itself controls and limits, not from the Archive crawling and preserving the content in the first place. The Archive has also been actively in dialogue with publishers to find workable arrangements. The Guardian itself said it has been "working directly with the Internet Archive" to implement its access limits, rather than imposing a unilateral hard block. But the Archive's position, that it is a neutral preservation institution, not an AI training pipeline, does not fully resolve the publishers' concern that third parties can access its data regardless of the Archive's own intentions. The problem with the publishers' response is that the instrument they are using, blocking the Archive's crawlers. has consequences that extend far beyond AI companies. When a news article is no longer archived, it becomes editable without accountability. Publishers can and do quietly amend stories after publication: correcting errors, softening claims, removing quotes. The Wayback Machine has been the primary tool journalists use to document those changes. The Electronic Frontier Foundation's Joe Mullin put the stakes bluntly: "The Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake." Wikipedia links to over 2.6 million news articles preserved by the Wayback Machine across 249 languages. Courts have used archived pages as evidence. Journalists have used them to prove government agencies changed official statements after publication. USA Today Co.'s decision to block access has effectively removed hundreds of local newspapers from the historical record, at a moment when local journalism is already in crisis, and every preserved article represents documentation that may not exist anywhere else. A petition organised by Fight for the Future, signed by over 100 working journalists, has pushed back against the blocking trend, describing the Wayback Machine as a tool that "preserves the public record at a time where many major media outlets are questioning whether to allow it to do so. The Nieman Lab reported the petition in mid-April; the dispute is now escalating rather than resolving. Yet, the Wayback Machine dispute is a compressed version of a structural problem that runs through the entire AI copyright debate. The institutions designed to serve the public interest, a digital library, open web standards, publicly accessible archives, are becoming the path of least resistance for AI companies seeking training data, because the AI companies' direct scraping is increasingly being blocked, litigated, and metered. The result is that the more publishers and rights holders resist AI training directly, the more pressure accumulates on the public infrastructure they cannot control. As Michael Nelson, a computer scientist at Old Dominion University, told Nieman Lab: "Common Crawl and Internet Archive are widely considered to be the 'good guys' and are used by 'the bad guys' like OpenAI. In everyone's aversion to not be controlled by LLMs, I think the good guys are collateral damage." The EFF concludes that the right response is not to block the Archive but to sue the AI companies directly. "There are real disputes over AI training that must be resolved in courts." The publishers have, in fact, done exactly that: the Times' lawsuit against OpenAI is proceeding. But they appear to have concluded that waiting for courts to resolve those disputes is too slow, and are taking the faster, blunter option of blocking the Archive in the meantime.
[3]
Why news publishers are blocking AI from accessing internet archives
AI companies using archived news content could be a major violation of copyright laws, especially in the midst of active lawsuits against companies such as OpenAI and Perplexity. Around 245 global news organisations across nine countries are attempting to block the Internet Archive's crawlers. These are automated software bots that capture, display and archive content from web pages in the Internet Archive's public-facing interface, the Wayback Machine. The Archive holds over one trillion web pages dating all the way back to 1996, making it one of the biggest collective public information resources in the world. This includes past articles from major news organisations such as CNN, The New York Times, The Guardian, and USA Today. These web pages are used for a variety of purposes, for example, as primary sources for historians, or to prove changes after publication. Several news organisations are now pushing to block the crawlers as AI companies are now using the contents of the Archive to train Large Language Models (LLMs) without offering fair payment or acquiring permission. More than 20 major news organisations already block ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine, according to an analysis by AI-detection company Originality AI. However, at least one of the Archive's four crawling bots is blocked by 241 global news sites. A major chunk of these blocked sites is owned by USA Today Co, the US's biggest newspaper publisher. This means that hundreds of local publications have been practically removed from historical records. Archival news content provides massive quantities of high-quality text and images to train large-scale AI models in more human writing. This is available through URL and API interface, which allows different software to communicate with each other and request data, acting as a bridge between systems. This makes it even easier for AI companies to access archived data and train models. Another advantage is that content in the Internet Archive is already structured, attributed and dated. Much of the Internet Archive's data has already been found in key AI-training datasets. However, this is a major weakness for news organisations, which are already suing AI companies such as Perplexity and OpenAI for potential copyright violations. "The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us," Graham James, a spokesperson from The New York Times newspaper, said, as cited by The Next Web. "The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission." Other organisations, such as The Guardian, have taken a more conservative approach by limiting, rather than completely blocking the Archive's access. The Wayback Machine's director, Mark Graham, has maintained that they are merely "collateral damage" and that the real culprits are the AI companies which access past content through the Archive's interfaces. However, the Archive has taken measures of its own to limit this. This includes preventing large downloads of some site materials and limiting automated extraction in certain cases. Graham highlighted that the Archive functions as a key method of preservation. Without this, articles which are not archived can be edited without authorisation or accountability. This can be anything from changing or removing quotes, amending mistakes or redirecting claims and official statements. Currently, these changes are tracked by the Wayback Machine. This has led to some news organisations attempting to work with the Internet Archive to find acceptable compromises or workarounds which involve limiting access rather than hard blocks. Similarly, non-profit digital rights advocacy group Fight for the Future has also launched a petition, already signed by 100 current journalists, to protest against this blocking. This is especially at a time when public records and history are increasingly contested.
Share
Copy Link
Over 240 news organizations across nine countries are blocking web archives like Common Crawl and the Internet Archive's Wayback Machine to prevent AI companies from using their content without permission or compensation. Major outlets including The New York Times, CNN, and USA Today are restricting access, citing copyright violations as AI firms use archived news content to train large language models. The move raises concerns about preserving public records and historical accountability.
Major news publishers are mounting a coordinated effort to block web archives that AI companies exploit to train large language models without permission or compensation. Over 240 news organizations across nine countries, including The New York Times, CNN, USA Today, and The Guardian, are now restricting access to their content stored in digital archives
2
3
. The News/Media Alliance, representing 20 publishers, sent a letter to Common Crawl demanding the nonprofit honor opt-out requests and prohibit unauthorized use of their work for AI training purposes .
Source: Euronews
Danielle Coffey, president and CEO of News/Media Alliance, described the situation bluntly: "We do view this type of use of our content without permission as a copyright violation" . The alliance discovered the widespread use of Common Crawl data in AI training approximately three years ago, raising immediate concerns about copyright infringement.
Common Crawl, founded in 2007, operates a bot that traverses the web to create a digital archive accessible to anyone . AI companies including OpenAI, Google, and Meta Platforms have tapped this trove of archived news content to develop chatbots like ChatGPT. The nonprofit has received donations from OpenAI and Anthropic, raising questions about potential conflicts of interest .
Archived news content offers AI developers exactly what they need: structured, dated, attributed, high-quality writing accumulated over decades
2
. The Internet Archive's Wayback Machine, which has preserved more than one trillion web pages since 1996, makes enormous quantities of content accessible via API and URL interface—an ideal source for model training pipelines2
. A 2023 Washington Post analysis confirmed that data from the Internet Archive appeared in major AI training datasets2
.News outlets worry that chatbots and AI-powered search results will steal their content, reduce web traffic, and cost them valuable advertising revenue. The New York Times sued Microsoft and OpenAI over alleged copyright infringement causing billions of dollars in damages, claiming news articles were scraped and copied almost verbatim in chatbot responses . Graham James, a Times spokesperson, stated: "The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us"
2
.For publishers engaged in active lawsuits against OpenAI, Perplexity, and others, web archives represent a significant gap in their defenses
2
. Meanwhile, some companies including Axel Springer, Condé Nast, and Hearst have signed lucrative licensing deals that allow controlled use of their content in AI tools .At least 23 major news publications are blocking ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine, according to analysis by AI-detection startup Originality AI
2
. USA Today Co., the largest newspaper publisher in the US, accounts for a substantial share of blocked sites, effectively removing hundreds of local publications from historical records2
3
.Mark Graham, the Wayback Machine's director, characterized the situation plainly: "We are collateral damage"
2
3
. The Archive has implemented its own protective measures, including rate-limiting bulk downloads and blocking large-scale automated extraction from certain sites2
3
. Graham argues that publishers' rationale for blocking crawlers is "unfounded," since the risk comes from AI companies accessing archived material through interfaces the Archive controls, not from preservation activities themselves2
.Related Stories
The consequences of blocking web crawlers extend far beyond preventing AI training. When news articles are no longer archived, they become editable without accountability. Publishers can quietly amend stories after publication—correcting errors, softening claims, or removing quotes—and the Wayback Machine has been the primary tool journalists use to document those changes
2
3
. Courts cite the Archive, historians treat it as a primary source, and journalists rely on it for preservation of the public record2
.The Electronic Frontier Foundation's Joe Mullin framed the stakes clearly: "The Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake"
2
. Fight for the Future has launched a petition, already signed by 100 current journalists, protesting the blocking at a time when public records and history are increasingly contested3
.Some organizations are taking more measured approaches. The Guardian limited rather than fully blocked Archive access after discovering it was a frequent crawler
2
. Robert Hahn, head of business affairs at The Guardian, expressed particular concern about APIs: "A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP"2
.The Archive has been actively working with publishers to find acceptable compromises involving limited access rather than hard blocks
2
3
. Common Crawl established an opt-out registry allowing publishers to request content exclusion from web crawls, though a November investigation by The Atlantic claimed Common Crawl doesn't always honor these opt-out requests and circumvents paywalls—allegations the nonprofit denied . As legal battles continue and AI capabilities expand, publishers face difficult choices between protecting their intellectual property and maintaining historical records that serve public interest.Summarized by
Navi
15 Apr 2026•Policy and Regulation

23 Jul 2024

10 Sept 2025•Technology

1
Entertainment and Society

2
Policy and Regulation

3
Technology
