News websites block Wayback Machine as AI scraping fears threaten digital preservation

2 Sources

Share

Major news outlets including the New York Times and USA Today are blocking the Wayback Machine from archiving their content, citing fears that AI companies exploit the archive to train models. The move threatens the Internet Archive's ability to preserve the public record, even as some of these same publishers rely on the archive for their own investigative reporting.

Major News Organizations Block Wayback Machine Over AI Concerns

A growing number of news websites are blocking the Internet Archive's Wayback Machine, threatening one of the web's most vital historical preservation tools. According to research from Originality AI, 23 major news organizations are now preventing the archive's web crawler from accessing their content, representing a significant portion of the 241 sites that have implemented such restrictions

1

. Among the publishers blocking the Internet Archive are prominent outlets like the New York Times and USA Today

2

.

Source: TechRadar

Source: TechRadar

The decision stems from concerns about AI training models using archived content without permission. New York Times spokesperson Graham James stated that "Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us"

1

. Publishers worry that while they can block AI scraping directly from their sites, third-party AI firms can still access their material through the Wayback Machine's extensive library of web history.

The Irony of Blocking While Benefiting

The situation reveals a troubling contradiction. USA Today recently published an investigative report on US Immigration and Customs Enforcement's delayed disclosure of detainment policy impacts—research that relied extensively on the Wayback Machine

1

. Yet the outlet simultaneously blocks the archive from preserving its own content. Mark Graham, director of the Wayback Machine, highlighted this paradox: "They're able to pull together their story research because the Wayback Machine exists. At the same time, they're blocking access"

1

.

This isn't about readers circumventing paywalls, but rather the broader backlash against AI and content scraping. News organizations fear that archived versions of their articles provide an easy target for Large Language Models (LLMs) seeking training data, potentially enabling copyright violation on a massive scale.

Threats to Preserving Historical Web Content and Public Accountability

The trend extends beyond traditional media. Reddit has also blocked the Wayback Machine's web crawler due to identical AI concerns, while federal government websites have contributed to data loss by deleting content

2

. As more organizations restrict access, the Internet Archive's capacity for archiving web pages and maintaining an accurate public record faces serious erosion.

Graham warns that the consequences reach far beyond AI: "There's no question that the general locking-down of more and more of the public web is impacting society's ability to understand what's going on in our world"

1

. Third-party archives provide an incorruptible version of stories that can hold publishers accountable when content is revised or deleted after publication

2

.

What Comes Next for Digital Preservation

More than 100 media workers have signed a petition titled "Journalists applaud the Internet Archive's role in preserving the public record," demonstrating support from within the industry

1

. Graham remains in talks with news organizations to find solutions that address AI scraping concerns while maintaining access for historical preservation

2

.

The situation presents complex questions about copyright law, the rights of publishers, and society's need for transparent historical records. As AI continues reshaping the digital landscape, finding a balance between protecting intellectual property and maintaining the public record will determine whether future researchers, journalists, and citizens can access the web's history or face an increasingly locked-down internet where accountability becomes harder to enforce.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo