The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Wed, 2 Apr, 4:03 PM UTC
7 Sources
[1]
AI bots strain Wikimedia as bandwidth surges 50%
On Tuesday, the Wikimedia Foundation announced that relentless AI scraping is putting strain on Wikipedia's servers. Automated bots seeking AI model training data for LLMs have been vacuuming up terabytes of data, growing the foundation's bandwidth used for downloading multimedia content by 50 percent since January 2024. It's a scenario familiar across the free and open source software (FOSS) community, as we've previously detailed. The Foundation hosts not only Wikipedia but also platforms like Wikimedia Commons, which offers 144 million media files under open licenses. For decades, this content has powered everything from search results to school projects. But since early 2024, AI companies have dramatically increased automated scraping through direct crawling, APIs, and bulk downloads to feed their hungry AI models. This exponential growth in non-human traffic has imposed steep technical and financial costs -- often without the attribution that helps sustain Wikimedia's volunteer ecosystem. The impact isn't theoretical. The foundation says that when former US President Jimmy Carter died in December 2024, his Wikipedia page predictably drew millions of views. But the real stress came when users simultaneously streamed a 1.5-hour video of a 1980 debate from Wikimedia Commons. The surge doubled Wikimedia's normal network traffic, temporarily maxing out several of its Internet connections. Wikimedia engineers quickly rerouted traffic to reduce congestion, but the event revealed a deeper problem: The baseline bandwidth had already been consumed largely by bots scraping media at scale. This behavior is increasingly familiar across the FOSS world. Fedora's Pagure repository blocked all traffic from Brazil after similar scraping incidents covered by Ars Technica. GNOME's GitLab instance implemented proof-of-work challenges to filter excessive bot access. Read the Docs dramatically cut its bandwidth costs after blocking AI crawlers. Wikimedia's internal data explains why this kind of traffic is so costly for open projects. Unlike humans, who tend to view popular and frequently cached articles, bots crawl obscure and less-accessed pages, forcing Wikimedia's core datacenters to serve them directly. Caching systems designed for predictable, human browsing behavior don't work when bots are reading the entire archive indiscriminately. As a result, Wikimedia found that bots account for 65 percent of the most expensive requests to its core infrastructure despite making up just 35 percent of total pageviews. This asymmetry is a key technical insight: The cost of a bot request is far higher than a human one, and it adds up fast. Crawlers that evade detection Making the situation more difficult, many AI-focused crawlers do not play by established rules. Some ignore robots.txt directives. Others spoof browser user agents to disguise themselves as human visitors. Some even rotate through residential IP addresses to avoid blocking, tactics that have become common enough to force individual developers like Xe Iaso to adopt drastic protective measures for their code repositories. This leaves Wikimedia's Site Reliability team in a perpetual state of defense. Every hour spent rate-limiting bots or mitigating traffic surges is time not spent supporting Wikimedia's contributors, users, or technical improvements. And it's not just content platforms under strain. Developer infrastructure, like Wikimedia's code review tools and bug trackers, is also frequently hit by scrapers, further diverting attention and resources. These problems mirror others in the AI scraping ecosystem. Curl developer Daniel Stenberg has detailed how fake, AI-generated bug reports are wasting human time. SourceHut's Drew DeVault has highlighted how bots hammer endpoints like git logs, far beyond what human developers would ever need. Across the Internet, open platforms are experimenting with technical solutions: proof-of-work challenges, slow-response tarpits (like Nepenthes), collaborative crawler blocklists (like "ai.robots.txt"), and commercial tools like Cloudflare's AI Labyrinth. These approaches address the technical mismatch between infrastructure designed for human readers and the industrial-scale demands of AI training. Open commons at risk Wikimedia acknowledges the importance of providing "knowledge as a service," and its content is indeed freely licensed. But as the Foundation states plainly, "Our content is free, our infrastructure is not." The organization is now focusing on systemic approaches to this issue under a new initiative: WE5: Responsible Use of Infrastructure. It raises critical questions about guiding developers toward less resource-intensive access methods and establishing sustainable boundaries while preserving openness. The challenge lies in bridging two worlds: open knowledge repositories and commercial AI development. Many companies rely on open knowledge to train commercial models but don't contribute to the infrastructure making that knowledge accessible. This creates a technical imbalance that threatens the sustainability of community-run platforms. Better coordination between AI developers and resource providers could potentially resolve these issues through dedicated APIs, shared infrastructure funding, or more efficient access patterns. Without such practical collaboration, the platforms that have enabled AI advancement may struggle to maintain reliable service. Wikimedia's warning is clear: Freedom of access does not mean freedom from consequences.
[2]
AI crawlers cause Wikimedia Commons bandwidth demands to surge 50% | TechCrunch
The Wikimedia Foundation, the umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects, said on Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has surged by 50% since January 2024. The reason, the outfit wrote in a blog post Tuesday, isn't due to growing demand from knowledge-thirsty humans, but from automated, data-hungry scrapers looking to train AI models. "Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," the post reads. Wikimedia Commons is a freely accessible repository of images, videos and audio files that are available under open licenses or are otherwise in the public domain. Digging down, Wikimedia says that almost two-thirds (65%) of the most "expensive" traffic -- that is, the most resource-intensive in terms of the kind of content consumed -- was from bots. However, just 35% of the overall pageviews comes from these bots. The reason for this disparity, according to Wikimedia, is that frequently-accessed content stays closer to the user in its cache, while other less-frequently accessed content is stored further away in the "core data center," which is more expensive to serve content from. This is the kind of content that bots typically go looking for. "While human readers tend to focus on specific - often similar - topics, crawler bots tend to 'bulk read' larger numbers of pages and visit also the less popular pages," Wikimedia writes. "This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources." The long and short of all this is that the Wikimedia Foundation' site reliability team are having to spend a lot of time and resources blocking crawlers to avert disruption for regular users. And all this before we consider the cloud costs that the Foundation is faced with. In truth, this represents part of a fast-growing trend that is threatening the very existence of the open internet. Last month, software engineer and open source advocate Drew DeVault bemoaned the fact that AI crawlers ignore "robots.txt" files that are designed to ward off automated traffic. And "pragmatic engineer" Gergely Orosz also complained last week that AI scrapers from companies such as Meta have driven up bandwidth demands for his own projects. While open source infrastructure, in particular, is in the firing line, developers are fighting back with "cleverness and vengeance," as TechCrunch wrote last week. Some tech companies are doing their bit to address the issue, too -- Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow crawlers down. However, it's very much a cat-and-mouse game that could ultimately force many publishers to duck for cover behind logins and paywalls -- to the detriment of everyone who uses the web today.
[3]
AI data scrapers are an existential threat to Wikipedia
Wikipedia is one of the greatest knowledge resources ever assembled, containing crowdsourced contributions from millions of humans worldwide - and it faces a growing threat from artificial intelligence developers. The non-profit Wikimedia Foundation, which operates Wikipedia, says since January 2024 it has seen a 50 per cent increase in network traffic requesting image and video downloads from its catalogue. That surge mostly comes from automated data scraper programs, which developers use to collect training data for their AI models....
[4]
Wikipedia Faces Flood of AI Bots That Are Eating Bandwidth, Raising Costs
Wikipedia is paying the price for the AI boom: The online encyclopedia is grappling with rising costs from bots scraping its articles to train AI models, which is straining the site's bandwidth. On Tuesday, the nonprofit that hosts Wikipedia warned that "automated requests for our content have grown exponentially." This can disrupt access to the site, forcing the encyclopedia site to add more capacity and increasing Wikipedia's data center bill. "Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," the Wikimedia Foundation says. The Foundation noted, for example, that "since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%." However, the traffic isn't coming from human readers but automated programs constantly downloading "openly licensed images to feed images to AI models," the nonprofit says. Another problem is that bots will often gather data from less popular Wikipedia articles. "When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total," the foundation adds. The bots will even scrape "key systems in our developer infrastructure, such as our code review platform or our bug tracker," putting a further strain on the site's resources, the nonprofit says. In response, Wikipedia's site managers have imposed "case-by-case" rate limiting for the offending AI crawlers, or even banned them. But to address the problem over the long-term, the Wikimedia Foundation is developing a "Responsible Use of Infrastructure" plan, which notes the network strain from AI bot scrapers is "unsustainable." The foundation plans on gathering feedback from the Wikipedia community on the best ways to identify traffic from AI bots scrapers and filter their access. This includes requiring bot operators to go through authentication for high-volume scraping and API use. "Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance," the Wikimedia Foundation added. Reddit faced a similar conundrum in 2023. Microsoft, for example, didn't notify Reddit that it was scraping Reddit's content and using it for its AI features. Reddit later blocked Microsoft from scraping its site, an effort Reddit CEO Steve Huffman called "a real pain in the ass."
[5]
Wikimedia Foundation bemoans AI bot bandwidth burden
Crawlers snarfing long-tail content for training and whatnot cost us a fortune Web-scraping bots have become an unsupportable burden for the Wikimedia community due to their insatiable appetite for online content to train AI models. Representatives from the Wikimedia Foundation, which oversees Wikipedia and similar community-based projects, say that since January 2024, the bandwidth spent serving requests for multimedia files has increased by 50 percent. "This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models," explained Birgit Mueller, Chris Danis, and Giuseppe Lavagetto, from the Wikimedia Foundation in a public post. This increase is not coming from human readers "Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs." According to the Wikimedians, at least 65 percent of the traffic for the most expensive content served by Wikimedia Foundation datacenters is generated by bots, even though these software agents represent only about 35 percent of page views. That's due to the Wikimedia Foundation's caching scheme which distributes popular content to regional data centers around the globe for better performance. Bots visit pages without respect to their popularity, and their requests for less popular content means that material has to be fetched from the core data center, which consumes more computing resources. The heedlessness of ill-behaved bots has been a common complaint over the past year or so among those operating computing infrastructure for open source projects, as the Wikimedians themselves noted by pointing to our recent report on the matter. Last month, Sourcehut, a Git-hosting service, called out overly demanding web crawlers that snarf content for AI companies. Diaspora developer Dennis Schubert, repair site iFixit, and ReadTheDocs have also objected to aggressive AI crawlers, among others. Most websites recognize the need to provide bandwidth to serve bot inquiries as a cost of doing business because these scripted visits help make online context easier to discover by indexing it for search engines. But since ChatGPT came online and generative AI took off, bots have become more willing to stripmine entire websites for content that's used to train AI models. And these models may end up as commercial competitors, offering the aggregate knowledge they've gathered for a subscription fee or for free. Either scenario has the potential to reduce the need for the source website or for search queries that generate online ad revenue. The Wikimedia Foundation in its 2025/2026 annual planning document, as part of its Responsible Use of Infrastructure section, cites a goal to "reduce the amount of traffic generated by scrapers by 20 percent when measured in terms of request rate, and by 30 percent in terms of bandwidth." We want to favour human consumption Noting that Wikipedia and its multimedia repository Wikimedia Commons are invaluable for training machine learning models, the planning document says "we have to prioritize who we serve with those resources, and we want to favour human consumption, and prioritize supporting the Wikimedia projects and contributors with our scarce resources." How that's to be achieved, beyond the targeted interventions already undertaken by site reliability engineers to block the most egregious bots, is left to the imagination. As concern about abusive AI content harvesting has been an issue for some time, quite a few tools have emerged to thwart aggressive crawlers. These include: Data poisoning projects such as Glaze, Nightshade, and ArtShield; and network-based tools including Kudurru, Nepenthes, AI Labyrinth, and Anubis. Last year, when word of the web's discontent with AI crawlers reached the major patrons of AI bots - Google, OpenAI, and Anthropic, among others - there was some effort to provide methods to prevent AI crawlers from visiting websites through the application of robots.txt directives. But these instructions, stored at the root of websites so they can be read by arriving web crawlers, are not universally deployed or respected. Nor can this optional declarative defensive protocol, if not done via wildcard character to cover every possibility, keep up when a name change is all that's required to evade a block list entry. A common claim among those operating websites is that misbehaving bots misidentify themselves as Googlebot or some other widely tolerated crawler so they don't get blocked. Wikipedia.org, for example, doesn't bother to block AI crawlers from Google, OpenAI, or Anthropic in its robots.txt file. It blocks a number of bots deemed troublesome for their penchant for slurping whole sites but has failed to include entries for major commercial AI firms. The Register has asked the Wikimedia Foundation why it hasn't banned AI crawlers more comprehensively. ®
[6]
Wikipedia is struggling with voracious AI bot crawlers
The Wikimedia Foundation is getting pummeled by crawlers, which could cause issues for actual readers. Wikimedia has seen a 50 percent increase in bandwidth used for downloading multimedia content since January 2024, the foundation said in an update. But it's not because human readers have suddenly developed a voracious appetite for consuming Wikipedia articles and for watching videos or downloading files from Wikimedia Commons. No, the spike in usage came from AI crawlers, or automated programs scraping Wikimedia's openly licensed images, videos, articles and other files to train generative artificial intelligence models. This sudden increase in traffic from bots could slow down access to Wikimedia's pages and assets, especially during high-interest events. When Jimmy Carter died in December, for instance, people's heightened interest in the video of his presidential debate with Ronald Reagan caused slow page load times for some users. Wikimedia is equipped to sustain traffic spikes from human readers during such events, and users watching Carter's video shouldn't have caused any issues. But "the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," Wikimedia said. The foundation explained that human readers tend to look up specific and often similar topics. For instance, a number of people look up the same thing when it's trending. Wikimedia creates a cache of a piece of content requested multiple times in the data center closest to the user, enabling it to serve up content faster. But articles and content that haven't been accessed in a while have to be served from the core data center, which consumes more resources and, hence, costs more money for Wikimedia. Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center. Wikimedia said that upon a closer look, 65 percent of the resource-consuming traffic it gets is from bots. It's already causing constant disruption for its Site Reliability team, which has to block the crawlers all the time before they they significantly slow down page access to actual readers. Now, the real problem, as Wikimedia states, is that the "expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement." A foundation that relies on people's donations to continue running needs to attract new users and get them to care for its cause. "Our content is free, our infrastructure is not," the foundation said. Wikimedia is now looking to establish sustainable ways for developers and reusers to access its content in the upcoming fiscal year. It has to, because it sees no sign of AI-related traffic slowing down anytime soon.
[7]
Wikipedia servers are struggling under pressure from AI scraping bots
Editor's take: AI bots have recently become the scourge of websites dealing with written content or other media types. From Wikipedia to the humble personal blog, no one is safe from the network sledgehammer wielded by OpenAI and other tech giants in search of fresh content to feed their AI models. The Wikimedia Foundation, the nonprofit organization hosting Wikipedia and other widely popular websites, is raising concerns about AI scraper bots and their impact on the foundation's internet bandwidth. Demand for content hosted on Wikimedia servers has grown significantly since the beginning of 2024, with AI companies actively consuming an overwhelming amount of traffic to train their products. Wikimedia projects, which include some of the largest collections of knowledge and freely accessible media on the internet, are used by billions of people worldwide. Wikimedia Commons alone hosts 144 million images, videos, and other files shared under a public domain license, and it is especially suffering from the unregulated crawling activity of AI bots. The Wikimedia Foundation has experienced a 50 percent increase in bandwidth used for multimedia downloads since January 2024, with traffic predominantly coming from bots. Automated programs are scraping the Wikimedia Commons image catalog to feed the content to AI models, the foundation states, and the infrastructure isn't built to endure this type of parasitic internet traffic. Wikimedia's team had clear evidence of the effects of AI scraping in December 2024, when former US President Jimmy Carter passed away, and millions of viewers accessed his page on the English edition of Wikipedia. The 2.8 million people reading the president's bio and accomplishments were 'manageable,' the team said, but many users were also streaming the 1.5-hour-long video of Carter's 1980 debate with Ronald Reagan. As a result of the doubling of normal network traffic, a small number of Wikipedia's connection routes to the internet were congested for around an hour. Wikimedia's Site Reliability team was able to reroute traffic and restore access, but the network hiccup shouldn't have happened in the first place. By examining the bandwidth issue during a system migration, Wikimedia found that at least 65 percent of the most resource-intensive traffic came from bots, passing through the cache infrastructure and directly impacting Wikimedia's 'core' data center. The organization is working to address this new kind of network challenge, which is now affecting the entire internet, as AI and tech companies are actively scraping every ounce of human-made content they can find. "Delivering trustworthy content also means supporting a 'knowledge as a service' model, where we acknowledge that the whole internet draws on Wikimedia content," the organization said. Wikimedia is promoting a more responsible approach to infrastructure access through better coordination with AI developers. Dedicated APIs could ease the bandwidth burden, making identification and the fight against "bad actors" in the AI industry easier.
Share
Share
Copy Link
The Wikimedia Foundation reports a 50% increase in bandwidth consumption due to AI bots scraping content, causing technical and financial strain on their infrastructure.
The Wikimedia Foundation, the organization behind Wikipedia and other crowdsourced knowledge projects, has reported a significant increase in bandwidth consumption. Since January 2024, the foundation has experienced a 50% surge in bandwidth usage for multimedia downloads from Wikimedia Commons 1. This surge is primarily attributed to automated bots scraping content for AI model training, rather than increased human traffic.
The foundation's infrastructure, designed to handle sudden spikes in human traffic during high-interest events, is struggling to cope with the unprecedented volume of bot-generated traffic. Wikimedia's internal data reveals that bots account for 65% of the most expensive requests to its core infrastructure, despite making up only 35% of total pageviews 2.
This asymmetry in resource consumption is due to the nature of bot behavior. Unlike human users who tend to access popular and frequently cached content, bots indiscriminately crawl obscure and less-accessed pages. This forces Wikimedia's core datacenters to serve content directly, bypassing caching systems designed for predictable human browsing patterns 1.
The situation is further complicated by the sophisticated tactics employed by some AI-focused crawlers. Many of these bots ignore robots.txt directives, spoof browser user agents to appear as human visitors, and rotate through residential IP addresses to avoid blocking 1. This cat-and-mouse game has forced Wikimedia's Site Reliability team into a perpetual state of defense, diverting resources from supporting contributors, users, and technical improvements.
This issue is not unique to Wikimedia. Similar challenges are being faced across the open-source community and the broader internet. Other platforms like Fedora's Pagure repository, GNOME's GitLab instance, and Read the Docs have implemented various measures to combat excessive bot access and reduce bandwidth costs 1.
In response to these challenges, the Wikimedia Foundation is developing a "Responsible Use of Infrastructure" plan. This initiative aims to identify and filter access from AI bot scrapers, potentially requiring authentication for high-volume scraping and API use 4.
The foundation is also exploring systemic approaches under a new initiative called WE5: Responsible Use of Infrastructure. This raises critical questions about guiding developers toward less resource-intensive access methods and establishing sustainable boundaries while preserving openness 1.
The challenge lies in bridging the gap between open knowledge repositories and commercial AI development. Many companies rely on open knowledge to train commercial models but don't contribute to the infrastructure making that knowledge accessible. This creates a technical imbalance that threatens the sustainability of community-run platforms 1.
As the Wikimedia Foundation aptly states, "Our content is free, our infrastructure is not." 5 This situation calls for better coordination between AI developers and resource providers, potentially through dedicated APIs, shared infrastructure funding, or more efficient access patterns. Without such practical collaboration, the very platforms that have enabled AI advancement may struggle to maintain reliable service.
Reference
[1]
[3]
[5]
Wikipedia's volunteer editors form WikiProject AI Cleanup to combat the rising tide of AI-generated content, aiming to protect the integrity of the world's largest online encyclopedia.
4 Sources
4 Sources
Wikipedia announces a three-year AI strategy focused on supporting its volunteer community rather than replacing human editors. The plan aims to streamline workflows, improve content quality, and maintain human-centered decision-making.
5 Sources
5 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Freelancer.com's CEO Matt Barrie alleges that AI company Anthropic engaged in unauthorized data scraping from their platform. The accusation raises questions about data ethics and the practices of AI companies in training their models.
2 Sources
2 Sources
Companies are increasingly blocking AI web crawlers due to performance issues, security threats, and content guideline violations. These new AI-powered bots are more aggressive and intelligent than traditional search engine crawlers, raising concerns about data scraping practices and their impact on websites.
2 Sources
2 Sources