2 Sources
2 Sources
[1]
AI could mean the end of the Wayback Machine, as news websites are increasingly blocking it to prevent content scraping
The Wayback Machine's very existence is threatened -- and this is about AI, not paywalls * A growing number of major news sites are blocking the Wayback Machine * That reportedly includes 23 organizations that are preventing their content from appearing in the archive * This is happening due to fears that the Wayback Machine is being exploited for AI content scraping The Wayback Machine is under serious threat (and not for the first time), as a growing number of major news websites appear to be blocking the archiving system. If you're not familiar with the Wayback Machine, it's run by the non-profit Internet Archive, and is essentially a time machine that preserves a history of the web (and more besides). This can be vital when it comes to historical research, for example, or monitoring changes to websites. As Wired reports (via 9 to 5 Mac), there's a growing trend of online news outlets blocking the web crawler that the Internet Archive uses to gather its snapshots. Some 23 big news sites are now doing so, according to Originality AI (which specializes in AI detection). That includes the New York Times (based on a Nieman Lab report) and USA Today, with Wired highlighting that the latter recently published a report on how the US Immigrations and Customs Enforcement delayed the disclosure of key info about the impact of detainment policies. This was a piece which used the Wayback Machine extensively in its research. The irony of USA Today using this data in such a way, and yet blocking the Wayback Machine from accessing its own content -- which could potentially keep the news site itself honest in the future -- isn't lost on Wayback Machine director Mark Graham. Graham told Wired: "They're able to pull together their story research because the Wayback Machine exists. At the same time, they're blocking access." Of course, if more and more organizations start to block the Wayback Machine, then its ability to keep a historical record of online content is going to be increasingly eroded. Analysis: blame AI (again) So why is this happening? This isn't about readers circumventing paywalled content using the Wayback Machine, in case you thought that was the issue at stake. Would it surprise you to learn that it's actually about AI, in a roundabout way? Of course it wouldn't, and in predictable fashion it seems that the Internet Archive is caught up in the broad backlash against AI here. What these news organizations say they object to is not a historical record of their content being maintained, but the fact that this archive can be used by third-party AI firms to train their models (LLMs). As Wired points out, New York Times spokesperson Graham James said: "The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us." In short, the worry for these companies is that they might be able to block such AI scraping activities themselves, but this will still be happening behind their backs via the Wayback Machine. It's not just major news outlets that have these worries, either, but also social media platforms, notably Reddit, which has blocked the Wayback Machine's web crawler due to the exact same concerns. While there are other possible sources and ways of indirectly scraping news content, the Wayback Machine is the most obvious target for rogue AI operators, as it maintains such an extensive library of web history. So, this is a complex issue bound up in AI scraping and a whole lot of grey areas in terms of the legality therein. However, the effect on what is an important resource for keeping a check on governments or media giants -- and holding them accountable for what was said in the past, or what's been entirely deleted from the web in some cases -- is clearly a worrying one. Graham asserts that: "There's no question that the general locking-down of more and more of the public web is impacting society's ability to understand what's going on in our world." A petition entitled 'Journalists applaud the Internet Archive's role in preserving the public record' has been put together and sent off with over 100 signatures from working journalists. Meanwhile, a dialogue remains ongoing between the Internet Archive and said news publishers, so hope of finding a workable solution here isn't lost yet. Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button! And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.
[2]
News outlets like NYT and USA Today are blocking the Internet Archive's Wayback Machine to prevent AI training models from using their content | Fortune
What some consider to be the digital library of Alexandria is in danger of losing valuable scrolls. Major media outlets are blocking the Internet Archive's Wayback Machine from saving web pages to prevent AI giants from training models on snapshots of old articles. Wired reported that 23 news organizations, including USA Today and the New York Times, are among the 241 sites denying Internet Archive's web crawler access to their articles. It's not personal -- some outlets still use the Archive in their reporting -- it's about the looming threat of AI: Publishers can archive their material, but a third party maintains a more incorruptible version of stories that can hold outlets accountable when it's revised after publication. Nothing new: Last year, Reddit barred the Wayback Machine from data scraping for similar AI concerns. The archive also lost a slew of information when federal government websites were deleted. Still working: Graham is reportedly in talks to regain access to the material, while more than 100 media workers signed a letter supporting Wayback. -- DL
Share
Share
Copy Link
Major news outlets including the New York Times and USA Today are blocking the Wayback Machine from archiving their content, citing fears that AI companies exploit the archive to train models. The move threatens the Internet Archive's ability to preserve the public record, even as some of these same publishers rely on the archive for their own investigative reporting.
A growing number of news websites are blocking the Internet Archive's Wayback Machine, threatening one of the web's most vital historical preservation tools. According to research from Originality AI, 23 major news organizations are now preventing the archive's web crawler from accessing their content, representing a significant portion of the 241 sites that have implemented such restrictions
1
. Among the publishers blocking the Internet Archive are prominent outlets like the New York Times and USA Today2
.
Source: TechRadar
The decision stems from concerns about AI training models using archived content without permission. New York Times spokesperson Graham James stated that "Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us"
1
. Publishers worry that while they can block AI scraping directly from their sites, third-party AI firms can still access their material through the Wayback Machine's extensive library of web history.The situation reveals a troubling contradiction. USA Today recently published an investigative report on US Immigration and Customs Enforcement's delayed disclosure of detainment policy impacts—research that relied extensively on the Wayback Machine
1
. Yet the outlet simultaneously blocks the archive from preserving its own content. Mark Graham, director of the Wayback Machine, highlighted this paradox: "They're able to pull together their story research because the Wayback Machine exists. At the same time, they're blocking access"1
.This isn't about readers circumventing paywalls, but rather the broader backlash against AI and content scraping. News organizations fear that archived versions of their articles provide an easy target for Large Language Models (LLMs) seeking training data, potentially enabling copyright violation on a massive scale.
The trend extends beyond traditional media. Reddit has also blocked the Wayback Machine's web crawler due to identical AI concerns, while federal government websites have contributed to data loss by deleting content
2
. As more organizations restrict access, the Internet Archive's capacity for archiving web pages and maintaining an accurate public record faces serious erosion.Graham warns that the consequences reach far beyond AI: "There's no question that the general locking-down of more and more of the public web is impacting society's ability to understand what's going on in our world"
1
. Third-party archives provide an incorruptible version of stories that can hold publishers accountable when content is revised or deleted after publication2
.Related Stories
More than 100 media workers have signed a petition titled "Journalists applaud the Internet Archive's role in preserving the public record," demonstrating support from within the industry
1
. Graham remains in talks with news organizations to find solutions that address AI scraping concerns while maintaining access for historical preservation2
.The situation presents complex questions about copyright law, the rights of publishers, and society's need for transparent historical records. As AI continues reshaping the digital landscape, finding a balance between protecting intellectual property and maintaining the public record will determine whether future researchers, journalists, and citizens can access the web's history or face an increasingly locked-down internet where accountability becomes harder to enforce.
Summarized by
Navi
1
Policy and Regulation

2
Technology

3
Policy and Regulation
