2 Sources
[1]
Internet archival sites struggling to preserve the internet because of skyrocketing hard drive prices due to the AI boom -- Wayback Machine and Wikimedia punished by stratospheric storage pricing and stricter anti-scraping measures blocking the wrong bots
The internet is getting harder to archive because the AI boom has caused a storage crisis, with both NAND and mechanical drives facing shortages. The same large-capacity HDDs now cost up to 3x more due to shriveled production capacities that have otherwise been entirely booked out by hyperscalers. These rising prices have made it difficult to preserve data at the usual rate across the industry, as reported by 404 Media. The Internet Archive, whose mission is to provide "universal access to all knowledge," is one of the organizations affected by this crisis. It holds around 210 petabytes of archives, with another 100 terabytes added every day to collections like the Wayback Machine. Amidst the AI boom, maintaining it has become "a very real issue costing us time and money," founder Brewster Kahle told 404 Media. The 28-30TB hard drives ideal for the job are simply out of stock or available at a grossly inflated price. Fortunately, Internet Archive has active donors and a passionate community of bit-rot fighters that help alleviate some of these concerns, but only by finding workarounds. The organization is also trying to source drives from manufacturers, but they're likely busy with backorders instead. Wikimedia Foundation, the parent non-profit behind Wikipedia, shares similar sentiments, explaining how maintaining over 65 million articles already requires careful budget allocation, which the current turbulence has only exacerbated. A spokesperson told 404 Media that it sees "the primary impact in the purchase of memory and hard drives but also in terms of lead times on server deliveries and our capacity to place future orders." Beyond the shortage, the AI boom has managed to affect archival efforts in another way that's likely not reversible: scraping. LLMs are trained on huge chunks of data often acquired from the internet, sometimes even illegally. As you'd expect, a lot of sites don't appreciate being randomly scraped to become part of some AI's learning material, so they've put up countermeasures that prevent companies from doing so. Archiving the internet shares the same first step -- it needs to extract information in order to preserve it, but website operators have been increasingly blocking such efforts. Bots that would otherwise scrape a site just to produce a snapshot for educational purposes are now being treated the same way as a bot looking to gather information for artificial intelligence, unintentionally or not. People in the community who contribute to preservation efforts are also having to think twice about what to preserve. Since hard drives are so expensive now, even enthusiasts part of the r/DataHoarders subreddit are doom-posting about how they've stopped archiving entirely, waiting for prices to level out. You can occasionally find deals, but seeing a large-capacity drive at MSRP has become nearly impossible. Those are regular individuals struggling to keep up with rising costs, while the larger non-profits are still managing to scrape by (pun intended). But what about the players in the middle? End of Term Archive, dedicated to archiving government websites between different administrations, is holding onto hope that things will settle down by the time they need to upgrade. Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
[2]
The Wayback Machine faces another threat from AI -- ridiculously expensive hard drive prices
AI's component black hole threatens to pull in another victim * The Wayback Machine is under threat from AI once more * The AI boom has tripled the price of the large hard disks needed for this expansive archive of the web * This is a further danger posed to the Wayback Machine, which is also in trouble due to news sites blocking its web crawler, which is again due to AI It's an increasingly desperate time for those trying to keep a record of the history of the web, as AI is again proving a serious stumbling block to the efforts made by the likes of the Internet Archive -- and this time it's about soaring hard drive prices. You may recall that last month, we covered another angle on the difficulties AI has been causing the Internet Archive's Wayback Machine. This is the non-profit organization's history of the web, and there's a problem in that, as part of measures designed to foil AI scraping their content, online news sites are increasingly blocking the web crawler the Internet Archive uses to compile the snapshots of web pages that comprise the archive. And now, 404 Media reports (via Tom's Hardware) that the Internet Archive is suffering due to the hard drive shortage brought on by AI (as more large drives are needed in data centers for AI workloads). Yes, the AI boom is not just about LLMs (Large Language Models) eating your RAM and SSDs, but also hard drives (as well as indirect effects on other components). The huge hard disks -- on the order of 30TB -- that the Internet Archive needs to host the Wayback Machine's historical record are now up to three times more expensive, or indeed completely out of stock. In this way, the AI boom is now a "very real issue costing us time and money," the Internet Archive's founder Brewster Kahle commented to 404 Media. With some 210 petabytes (210,000TB) of web page snapshots in its library, which is expanding by 100TB daily, you can appreciate the scope of the web archiving that's going on here. Wikipedia's parent non-profit, the Wikimedia Foundation, is reportedly facing similar struggles, as you'd imagine. It has some 65 million articles to host, which takes up a lot of drive space. A Wikimedia Foundation spokesperson told 404 Media that the main problems are the "purchase of memory and hard drives", but also lead times on server deliveries. Analysis: workarounds aplenty -- but what about tape? So, is the Wayback Machine really in danger? Are we going to see the wheels start to come off the 'living history of the internet'? Well, there's no immediate peril, as apparently donors and the community around the Wayback Machine are pulling together to work around the issue of spiralling drive costs. Still, this is clearly a concern going forward -- and the blocking of the Internet Archive's web crawler is even more so. The problem there is that the news sites are blocking AI scraping, but those blocks can be circumvented if the owner of the AI targets the content via the Wayback Machine instead. It's a thorny issue, but talks are ongoing, and hopefully both sides can come to some kind of resolution. And on the drive front, if you're wondering why the Internet Archive can't switch to tape as a storage medium, the catch there is that it's a 'living' archive of the web -- as in it's online, for people to access those web page snapshots on demand. As such, hard drives are needed for that access to be responsive. Tape simply isn't up to snuff performance-wise in this case. The Internet Archive does use tape, mind, for longer-term backups of content, but it's only part of the puzzle in that respect. Hard drives are vital for the actual day-to-day functioning of the Wayback Machine as we know it, in terms of being able to quickly serve users the content they need online. Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.
Share
Copy Link
The AI boom has triggered a storage crisis that threatens internet archiving efforts. Hard drive prices have surged up to three times their normal cost, with 28-30TB drives either out of stock or grossly inflated. The Internet Archive, which stores 210 petabytes and adds 100 terabytes daily, now faces mounting costs while anti-scraping measures designed to block AI bots inadvertently block archival efforts too.

The AI boom has created an unexpected casualty: organizations dedicated to internet archiving are struggling to preserve digital history as hard drive prices skyrocket to unprecedented levels. The Internet Archive, home to the Wayback Machine and custodian of approximately 210 petabytes of data, now confronts what founder Brewster Kahle describes as "a very real issue costing us time and money."
1
The organization adds another 100 terabytes to its collections daily, making the current hard drive shortage particularly acute.Both NAND and mechanical HDDs face severe shortages as hyperscalers book out production capacities for AI data centers. The 28-30TB hard drives essential for preserving digital content now cost up to three times their previous price—when they're available at all.
2
This stratospheric storage pricing forces archival organizations to make difficult choices about what content they can afford to preserve and at what pace.The Wikimedia Foundation, which maintains Wikipedia and over 65 million articles, faces similar pressures from the storage crisis. A spokesperson explained that the organization sees "the primary impact in the purchase of memory and hard drives but also in terms of lead times on server deliveries and our capacity to place future orders."
1
The organization must now carefully allocate budgets that were already stretched thin, with current market turbulence exacerbating existing constraints.The Internet Archive attempts to source drives directly from manufacturers, but those suppliers remain busy fulfilling backorders from larger clients. While the organization benefits from active donors and a community committed to fighting digital decay, these supporters can only provide workarounds rather than systemic solutions. Finding large-capacity drives at manufacturer's suggested retail price has become nearly impossible, even for casual enthusiasts.
Beyond the hard drive shortage, the AI boom threatens internet archiving through another mechanism: anti-scraping measures. As LLMs require massive datasets often acquired through data scraping—sometimes illegally—websites have implemented countermeasures to prevent unauthorized extraction of their content. These protective barriers don't distinguish between web crawler bots gathering information for AI training and those creating snapshots for educational purposes and digital preservation.
Blocking archival bots has become increasingly common as website operators treat all automated scraping with suspicion. The Wayback Machine's web crawler, designed to compile historical snapshots of web pages, now faces the same barriers erected against AI companies.
2
This creates a particularly thorny problem: AI companies can potentially circumvent blocks by accessing content through the Wayback Machine itself, making news sites wary of allowing any archival access.Related Stories
The impact extends beyond major non-profits to individual contributors who support preserving digital content. Members of communities like the r/DataHoarders subreddit report scaling back or entirely stopping their archival efforts, waiting for prices to stabilize. While occasional deals surface, the consistent availability of affordable, large-capacity storage has evaporated. Organizations like the End of Term Archive, which documents government websites between different administrations, hold onto hope that market conditions will improve before their next upgrade cycle.
The Internet Archive does utilize tape storage for longer-term backups, but this medium cannot replace hard drives for the "living archive" that users access on demand. Tape storage lacks the performance characteristics necessary for responsive information access, making HDDs essential for day-to-day operations. As talks continue between archival organizations and content providers about resolving the web crawler blocking issue, the storage crisis represents a more immediate financial burden that threatens the pace and scope of knowledge preservation efforts worldwide.
Summarized by
Navi
15 Feb 2026•Business and Economy

16 Sept 2025•Technology

15 Apr 2026•Policy and Regulation

1
Technology

2
Entertainment and Society

3
Policy and Regulation
