2 Sources
2 Sources
[1]
AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums
"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem." AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today. While the impact of AI bots on open collections has been reported anecdotally, the survey is the first attempt at measuring the problem, which in the worst cases can make valuable, public resources unavailable to humans because the servers they're hosted on are being swamped by bots scraping the internet for AI training data.
[2]
AI is breaking the internet's memory
AI bots are quietly overwhelming the digital infrastructure behind our cultural memory. In early 2025, libraries, museums, and archives around the world began reporting mysterious traffic surges on their websites. The culprit? Automated bots scraping entire online collections to fuel training datasets for large AI models. What started as a few isolated incidents is now becoming a global pattern. To investigate, the GLAM-E Lab (focused on Galleries, Libraries, Archives, and Museums) launched a survey that reached 43 institutions across North America, Europe, and Oceania. Their findings reveal a growing tension between open access and technical resilience in the face of AI-scale data extraction. Of the 43 institutions surveyed, 39 reported recent traffic spikes. Most had no idea what was happening until their servers slowed down or went offline entirely. When they dug deeper, they discovered that many of these requests came from bots, often linked to companies building training corpora for large AI models. Unlike traditional search engine crawlers, these bots don't operate gently or gradually. They arrive in dense, rapid waves, downloading everything, following every link, and ignoring signals like . Their activity mimics a distributed denial-of-service attack, even if their intent is simply data collection. Each GLAM institution has its own digital setup. Some operate on robust cloud architectures, others on legacy systems barely equipped to handle regular visitor loads. When bots strike, the impact can be wildly uneven. A national museum might absorb the spike; a community archive could crash within minutes. The analytics tools used by these institutions weren't built to detect bots. Many respondents said they only discovered the true source of the traffic after breakdowns occurred. Some had mistaken bot visits for rising public interest until they realized those numbers couldn't be trusted. One might assume that bots target only openly licensed content. The reality is more blunt: bots do not care. Both open and restricted collections are scraped. Licensing signals aren't being read, let alone respected. That puts every digital collection online no matter how carefully curated at risk of exploitation and collapse. This presents a dilemma. GLAM institutions exist to share culture and knowledge widely. But the same openness that serves the public is also what exposes them to industrial-scale scraping from AI developers, many of whom provide no attribution, compensation, or regard for infrastructure costs. Institutions reported seeing bots arrive in swarms, often rotating IP addresses and spoofing user agents to avoid detection. Traffic would surge without warning, spike server CPU to 100%, and crash systems for hours or days. After grabbing what they needed, the bots would disappear, until the next swarm. Some respondents described patterns where bots revisited monthly. Others saw increasing frequency, suggesting either growing demand or more actors entering the AI training space. In all cases, the disruption was real, measurable, and costly. Many GLAM teams deployed countermeasures: firewalls, IP blocks, geofencing, and bot detection services like Cloudflare. But each solution has trade-offs. Blocking by geography might prevent legitimate researchers from accessing materials. User agent filtering is easy for bad actors to circumvent. Some institutions considered login gates, but that conflicts with their public access mission. The most effective countermeasures are also the most expensive. Scaling up server capacity, migrating infrastructure, or integrating sophisticated traffic monitoring tools costs money -- and cultural institutions often have none to spare. One of the deeper questions raised by the report is philosophical. If bots now represent a significant share of traffic, do they count as users? Should institutions try to serve them, block them, or treat them as a new class of visitor? Many institutions said their visitor counts were inflated by bot traffic. Once corrected, the real engagement metrics painted a very different picture. Updating no longer works. Reporting abuse gets mixed results. Adding login barriers risks excluding welcome visitors. Even identifying which bots are "good" (e.g., search engines) and which are "bad" (AI scrapers) is murky as the boundaries between indexing and dataset collection blur. Some institutions are considering building APIs to serve bots more efficiently. But that assumes the bots will use them, which they likely won't. Others are hoping for legal protections, like those proposed in the EU's Digital Single Market directive. But enforcement is far from guaranteed. This isn't just a technical challenge. It's a stress test for the values of openness and access in the digital age. The GLAM community, despite its global diversity, shares a strikingly unified ethic: culture should be freely accessible. But the infrastructure supporting that ethic wasn't designed to handle AI-scale extraction. If AI companies want to rely on the public internet as a training ground, they may need to support its maintenance. That could mean abiding by better standards, funding sustainable access programs, or respecting new opt-out protocols.
Share
Share
Copy Link
AI bots are overwhelming the servers of libraries, archives, museums, and galleries, causing disruptions and raising concerns about the sustainability of open access to cultural resources.
A recent survey by the GLAM-E Lab has revealed a growing crisis in the digital preservation of cultural heritage. AI scraping bots are overwhelming the servers of libraries, archives, museums, and galleries (GLAM institutions), causing significant disruptions and in some cases knocking entire collections offline
1
2
.Of the 43 institutions surveyed across North America, Europe, and Oceania, 39 reported recent traffic spikes attributed to AI bots. These bots, often linked to companies building training corpora for large AI models, arrive in dense, rapid waves, downloading entire collections and ignoring traditional web crawling etiquette
2
.The impact varies widely depending on the institution's digital infrastructure. While some larger organizations can absorb the increased traffic, smaller community archives may crash within minutes of a bot attack. Many institutions only discovered the true source of the traffic after experiencing breakdowns, as their analytics tools were not designed to detect this type of bot activity
2
.This situation presents a significant dilemma for GLAM institutions, whose mission is to share culture and knowledge widely. The same openness that serves the public also exposes them to industrial-scale scraping from AI developers, often without attribution, compensation, or regard for infrastructure costs
2
.Institutions have reported bots arriving in swarms, rotating IP addresses, and spoofing user agents to avoid detection. These attacks can spike server CPU usage to 100% and crash systems for hours or days
2
.Many GLAM teams have deployed various countermeasures, including firewalls, IP blocks, geofencing, and bot detection services. However, each solution comes with trade-offs. For example, blocking by geography might prevent legitimate researchers from accessing materials, while user agent filtering can be easily circumvented
2
.The most effective countermeasures, such as scaling up server capacity or integrating sophisticated traffic monitoring tools, are often prohibitively expensive for cultural institutions with limited budgets
2
.Related Stories
This crisis raises deeper philosophical questions about the nature of digital access in the AI age. If bots now represent a significant share of traffic, should institutions try to serve them, block them, or treat them as a new class of visitor? The situation is testing the values of openness and access in the digital age, as the infrastructure supporting these ethics wasn't designed to handle AI-scale extraction
2
.Source: Dataconomy
Some institutions are considering building APIs to serve bots more efficiently, while others are hoping for legal protections. However, enforcement of such measures is far from guaranteed
2
.The GLAM community is calling for AI companies to support the maintenance of the public internet if they intend to use it as a training ground. This could involve abiding by better standards, funding sustainable access programs, or respecting new opt-out protocols
2
.Source: 404 Media
As this situation continues to evolve, it's clear that a balance must be struck between preserving open access to cultural resources and protecting the digital infrastructure that makes such access possible. The resolution of this crisis will likely shape the future of digital cultural preservation and AI development alike.
Summarized by
Navi
[2]
1
Business and Economy
2
Technology
3
Business and Economy