Curated by THEOUTPOST
On Sun, 15 Dec, 8:01 AM UTC
2 Sources
[1]
Companies alert as along come AI web spiders
AI crawlers are computer programs that collect data from websites to train large language models. Enterprises are increasingly blocking AI web crawlers due to performance issues, security threats, and violation of content guidelines. Unlike traditional crawlers, AI bots scrape high-quality data for training models, often ignoring ethical practices. Enterprises are increasingly resorting to blocking artificial intelligence (AI) web crawlers and spiders which are scraping the web bit-by-bit and hampering the performance of websites, according to industry executives and experts. AI crawlers are computer programs that gather data from websites to train large language models. With increased use of AI search and need for collecting training data, the internet is seeing many new web scrapers such as Bytespider, PerplexityBot, ClaudeBot and GPTBot. Until 2022, the internet had conventional search engine crawlers such as GoogleBot, AppleBot and BingBot which obeyed the principles of ethical content scraping and scheduling for decades. On the other hand, the aggressive AI bots are not only violating content guidelines but also degrading the performance of websites, adding overhead costs and posing security threats. Many websites and content portals are implementing anti-scraping measures or bot restriction technologies to counter this. According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains accessed by 80% of AI bots are moving to block AI crawlers. India's apex technology body Nasscom said these crawlers are especially damaging to news publishers if they use authored content without attribution. "If the use of copyrighted data for AI model training qualifies as fair use is moot," Raj Shekhar, Responsible AI lead at Nasscom told ET. "The legal dispute between ANI Media and OpenAI is a wake-up call for AI developers to heed IP (intellectual property) laws when collecting training data. Developers, therefore, must exercise caution and consult IP experts to ensure compliant data practices and avoid potential liabilities." Reuben Koh, director of security technology and strategy at content delivery network company Akamai Technologies, said, "Scraping poses a significant overhead and impacts the performance of a website. It does this by intensively interacting with the site, attempting to scrape every single piece of content. This results in a performance penalty." According to Cloudflare's analysis of top 10,000 internet domains, three AI bots had the highest share of websites accessed - Bytespider operated by China's TikTok (40.40%), GPTBot operated by OpenAI (35.46%) and ClaudeBot run by Anthropic (11.17%). Although these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them, it said. Meanwhile, there is CCBot, developed by Common Crawl, to scrape the web and create an open-source dataset which can be used by anyone. What sets AI crawlers apart AI crawlers are different from conventional crawlers - they target high-quality text, images and videos that can enhance training datasets. AI-powered crawlers are more intelligent than conventional search engine crawlers, "which just crawl, gather data, and stop there", said Akamai's Koh. "Their intelligence is not only used for data selection but also for data classification and prioritisation. This means that even after they crawl, index and scrape all the data, they can process what the data is going to be used for," he said. Traditionally, web scraper bots follow robots.txt protocol as a guiding principle on what can be indexed. Traditional search engine bots such as GoogleBot and BingBot adhere to this and stay away from intellectual property. However, AI bots have been found to violate the principles of robots.txt at multiple instances. "Google and Bing do not overwhelm websites because they follow a predictable and transparent indexing schedule. For instance, Google is clear about how often it indexes a particular domain, allowing companies to anticipate and manage the potential performance impact," Koh said. "With newer and more aggressive crawlers, like those driven by AI, the situation is less predictable. These crawlers don't necessarily operate on a fixed schedule, and their scraping activities can be much more intensive." Koh cautioned about a third category of crawlers which are malicious in nature and misuse data for frauds. According to Akamai's State of The Internet research, more than 40% of all internet traffic is from bots and about 65% of that is from malicious bots. Can't Block Them All However, experts said, eliminating AI crawlers cannot be the ultimate solution because websites need to be discovered. Websites need to show up in commercial search engine results, be discovered and gain customers, if AI search is set to be the new search practice, they said. "Enterprises are going to be concerned if we are blocking legitimate revenue generating crawl activity or bot activity. Or are we allowing too many malicious activities to happen on our website? It's a very fine balance, they need to understand," opined Koh.
[2]
Cos Alert as Along Come AI Web Spiders
Enterprises are increasingly resorting to blocking artificial intelligence (AI) web crawlers and spiders which are scraping the web bit-by-bit and hampering the performance of websites, according to industry executives and experts.Enterprises are increasingly resorting to blocking artificial intelligence (AI) web crawlers and spiders which are scraping the web bit-by-bit and hampering the performance of websites, according to industry executives and experts. AI crawlers are computer programs that gather data from websites to train large language models. With increased use of AI search and need for collecting training data, the internet is seeing many new web scrapers such as Bytespider, PerplexityBot, ClaudeBot and GPTBot. Until 2022, the internet had conventional search engine crawlers such as GoogleBot, AppleBot and BingBot which obeyed the principles of ethical content scraping and scheduling for decades. On the other hand, the aggressive AI bots are not only violating content guidelines but also degrading the performance of websites, adding overhead costs and posing security threats. Many websites and content portals are implementing anti-scraping measures or bot restriction technologies to counter this. According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains accessed by 80% of AI bots are moving to block AI crawlers. India's apex technology body Nasscom said these crawlers are especially damaging to news publishers if they use authored content without attribution. "If the use of copyrighted data for AI model training qualifies as fair use is moot," Raj Shekhar, Responsible AI lead at Nasscom told ET. "The legal dispute between ANI Media and OpenAI is a wake-up call for AI developers to heed IP (intellectual property) laws when collecting training data. Developers, therefore, must exercise caution and consult IP experts to ensure compliant data practices and avoid potential liabilities." Reuben Koh, director of security technology and strategy at content delivery network company Akamai Technologies, said, "Scraping poses a significant overhead and impacts the performance of a website. It does this by intensively interacting with the site, attempting to scrape every single piece of content. This results in a performance penalty." According to Cloudflare's analysis of top 10,000 internet domains, three AI bots had the highest share of websites accessed - Bytespider operated by China's TikTok (40.40%), GPTBot operated by OpenAI (35.46%) and ClaudeBot run by Anthropic (11.17%). Although these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them, it said. Meanwhile, there is CCBot, developed by Common Crawl, to scrape the web and create an open-source dataset which can be used by anyone. AI crawlers are different from conventional crawlers - they target high-quality text, images and videos that can enhance training datasets. AI-powered crawlers are more intelligent than conventional search engine crawlers, "which just crawl, gather data, and stop there", said Akamai's Koh. "Their intelligence is not only used for data selection but also for data classification and prioritisation. This means that even after they crawl, index and scrape all the data, they can process what the data is going to be used for," he said. Traditionally, web scraper bots follow robots.txt protocol as a guiding principle on what can be indexed. Traditional search engine bots such as GoogleBot and BingBot adhere to this and stay away from intellectual property. However, AI bots have been found to violate the principles of robots.txt at multiple instances."Google and Bing do not overwhelm websites because they follow a predictable and transparent indexing schedule. For instance, Google is clear about how often it indexes a particular domain, allowing companies to anticipate and manage the potential performance impact," Koh said. "With newer and more aggressive crawlers, like those driven by AI, the situation is less predictable. These crawlers don't necessarily operate on a fixed schedule, and their scraping activities can be much more intensive." Koh cautioned about a third category of crawlers which are malicious in nature and misuse data for frauds. According to Akamai's State of The Internet research, more than 40% of all internet traffic is from bots and about 65% of that is from malicious bots. However, experts said, eliminating AI crawlers cannot be the ultimate solution because websites need to be discovered. Websites need to show up in commercial search engine results, be discovered and gain customers, if AI search is set to be the new search practice, they said. "Enterprises are going to be concerned if we are blocking legitimate revenue generating crawl activity or bot activity. Or are we allowing too many malicious activities to happen on our website? It's a very fine balance, they need to understand," opined Koh.
Share
Share
Copy Link
Companies are increasingly blocking AI web crawlers due to performance issues, security threats, and content guideline violations. These new AI-powered bots are more aggressive and intelligent than traditional search engine crawlers, raising concerns about data scraping practices and their impact on websites.
In recent years, the internet has witnessed a surge in AI-powered web crawlers, presenting new challenges for companies and content providers. Unlike traditional search engine crawlers such as GoogleBot and BingBot, these AI bots are designed to collect high-quality data for training large language models. Popular AI crawlers include Bytespider, PerplexityBot, ClaudeBot, and GPTBot 12.
AI crawlers are more aggressive in their data collection methods, often violating content guidelines and degrading website performance. This has led to increased overhead costs and potential security threats for many websites. According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains accessed by 80% of AI bots are now moving to block these crawlers 1.
Reuben Koh, director of security technology and strategy at Akamai Technologies, explains that AI scraping poses significant overhead and impacts website performance. These bots intensively interact with sites, attempting to scrape every piece of content, resulting in performance penalties 12.
AI-powered crawlers differ from conventional ones in several ways:
The aggressive nature of AI crawlers has raised ethical and legal concerns, particularly regarding intellectual property rights. Nasscom, India's apex technology body, warns that these crawlers can be especially damaging to news publishers if they use authored content without attribution. The ongoing legal dispute between ANI Media and OpenAI serves as a wake-up call for AI developers to respect IP laws when collecting training data 12.
Cloudflare's analysis of the top 10,000 internet domains reveals that three AI bots had the highest share of websites accessed:
While many websites are implementing anti-scraping measures, experts caution that completely eliminating AI crawlers may not be the ultimate solution. Websites need to be discoverable, especially if AI search becomes the new standard for internet searches. Companies must strike a balance between blocking malicious activities and allowing legitimate crawling that can generate revenue 12.
Akamai's State of The Internet research reveals that more than 40% of all internet traffic comes from bots, with about 65% of that traffic originating from malicious bots. This highlights the complex landscape that website owners and content providers must navigate in the age of AI 12.
As the AI crawler ecosystem continues to evolve, companies and content providers will need to adapt their strategies to protect their assets while remaining discoverable in an increasingly AI-driven online environment.
Reference
[1]
[2]
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Cloudflare introduces new bot management tools allowing website owners to control AI data scraping. The tools enable blocking, charging, or setting conditions for AI bots accessing content, potentially reshaping the landscape of web data collection.
13 Sources
13 Sources
Freelancer.com's CEO Matt Barrie alleges that AI company Anthropic engaged in unauthorized data scraping from their platform. The accusation raises questions about data ethics and the practices of AI companies in training their models.
2 Sources
2 Sources
As AI technology advances, the critical data needed to train these systems is vanishing at an alarming rate. This shortage poses significant challenges for the future development of artificial intelligence.
2 Sources
2 Sources
As AI-powered search transforms the media landscape, newsrooms are adopting new strategies to stay relevant. From pivoting to reader-revenue models to leveraging AI for support tasks, media outlets are finding innovative ways to engage audiences and maintain their relevance in a rapidly changing digital environment.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved