2 Sources
[1]
Protection from AI crawlers eludes visual artists despite available tools, study shows
Visual artists want to protect their work from non-consensual use by generative AI tools such as ChatGPT. But most of them do not have the technical know-how or control over the tools needed to do so. One of the best ways to protect artists' creative work is to prevent it from ever being seen by "AI crawlers" -- the programs that harvest data on the internet for training generative models. But most artists don't have access to the tools that would allow them to take such actions. And when they do have access, they don't know how to use them. These are some of the conclusions of a study by a group of researchers at the University of California San Diego and University of Chicago, which will be presented at the 2025 Internet Measurement Conference in October in Madison, Wis. The study is published on the arXiv preprint server. "At the core of the conflict in this paper is the notion that content creators now wish to control how their content is used, not simply if it is accessible. While such rights are typically explicit in copyright law, they are not readily expressible, let alone enforceable in today's internet. "Instead, a series of ad hoc controls have emerged based on repurposing existing web norms and firewall capabilities, none of which match the specificity, usability, or level of enforcement that is, in fact, desired by content creators," the researchers write. The research team surveyed over 200 visual artists about the demand for tools to block AI crawlers, as well as the artists' technical expertise. Researchers also reviewed more than 1,100 professional artist websites to see how much control artists had over AI-blocking tools. Finally, the team evaluated which processes were the most effective at blocking AI crawlers. Currently, artists can fairly easily use some tools that mask original artworks from AI crawlers by turning the art into something different. The study's co-authors at the University of Chicago developed one of these tools, known as Glaze. But ideally, artists would be able to keep AI crawlers from harvesting their data altogether. To do so, visual artists need to defend themselves against three categories of AI crawlers. One type harvests data to train the large language models that power chatbots, another to increase the knowledge of AI-backed assistants, and yet another to support AI-backed search engines. Artist survey There has been extensive media coverage of how generative AI has severely disrupted the livelihoods of many artists. As a result, close to 80% of the 203 visual artists the researchers surveyed said they have tried to take proactive steps to keep their artwork from being included in training data for AI generating tools. Two-thirds reported using Glaze. In addition, 60% of artists have cut back on the amount of work they share online, and 51% of them share only low-resolution images of their work. Also, 96% of artists said they would like to have access to a tool that can deter AI crawlers from harvesting their data. But more than 60% of them were not familiar with one of the simplest tools that can do this: robots.txt. Tools for deterring AI crawlers Robots.txt is a simple text file placed in the root directory of a website that spells out which pages crawlers are allowed to access on that website. The text file can also spell out which crawlers are not allowed to have access to the website at all. But the crawlers have no obligation to follow these restrictions. Researchers surveyed the top 100,000 most popular websites on the internet and found that more than 10% have explicitly disallowed AI crawlers in their robots.txt files. But some sites, including Vox Media and The Atlantic, removed this prohibition after entering into licensing agreements with AI companies. Indeed, the number of sites allowing AI crawlers is increasing, including popular right-wing misinformation sites. Researchers hypothesize that these sites might seek to spread misinformation to LLMs. One issue for artists is that they do not have access to or control of the relevant robots.txt file. That's because, in a survey of 1100 artist websites, researchers found that more than three quarters are hosted on third-party service platforms, most of which do not allow for modifications of robots.txt. Many of these content management systems artists use also give them little to no information about what type of crawling is blocked. Squarespace is the only company that provides a simple interface for blocking AI tools. But researchers found that only 17% of artists who use Squarespace enable this option. This might be because often, artists are not aware that this service is available. But do crawlers respect the prohibitions listed in robots.txt, even though they are not mandatory? The answer is mixed. Crawlers from big corporations generally do respect robots.txt, both in claim and in practice. The only crawler that researchers could clearly determine does not is Bytespider, deployed by TikTok owner ByteDance. In addition, a large number of crawlers claim they respect robots.txt restrictions but researchers were not able to verify that this is actually the case. All in all, "the majority of AI crawlers operated by big companies do respect robots.txt, while the majority of AI assistant crawlers do not," the researchers write. More recently, network provider Cloudflare has launched a "block AI bots" feature. At this point, only 5.7% of the sites using Cloudflare have this option enabled. But researchers hope it will become more popular over time. "While it is an 'encouraging new option', we hope that providers become more transparent with the operation and coverage of their tools (for example by providing the list of AI bots that are blocked)," said Elisa Luo, one of the paper's authors and a Ph.D. student in Savage's research group. Legislative and legal uncertainties The global landscape around AI crawlers is constantly changing due to different legal changes and a wide range of legislative proposals. In the United States, AI companies face legal challenges around the extent to which copyright applies to models trained on data scraped from the internet and what their obligations might be to the creators of this content. In the European Union, a recently passed AI Act requires providers of AI models to get authorization from copyright holders to use their data. "There is reason to believe that confusion around the availability of legal remedies will only further focus attention on technical access controls," the researchers write. "To the extent that any U.S. court finds an affirmative 'fair use' defense for AI model builders, this weakening of remedies on use will inevitably create an even stronger demand to enforce controls on access."
[2]
How Can Visual Artists Protect Their Work From AI Crawlers? It's Complicated | Newswise
Squarespace provides a user-friendly option for controlling whether AI-related crawlers are disallowed in a site's robots.txt. Newswise -- Visual artists want to protect their work from non-consensual use by generative AI tools such as ChatGPT. But most of them do not have the technical know-how or control over the tools needed to do so. One of the best ways to protect artists' creative work is to prevent it from ever being seen by "AI crawlers" - the programs that harvest data on the Internet for training generative models. But most artists don't have access to the tools that would allow them to take such actions. And when they do have access, they don't know how to use them. These are some of the conclusions of a study by a group of researchers at the University of California San Diego and University of Chicago, which will be presented at the 2025 Internet Measurement Conference in October in Madison, Wis. "At the core of the conflict in this paper is the notion that content creators now wish to control how their content is used, not simply if it is accessible. While such rights are typically explicit in copyright law, they are not readily expressible, let alone enforceable in today's Internet. Instead, a series of ad hoc controls have emerged based on repurposing existing web norms and firewall capabilities, none of which match the specificity, usability, or level of enforcement that is, in fact, desired by content creators," the researchers write. The research team surveyed over 200 visual artists about the demand for tools to block AI crawlers, as well as the artists' technical expertise. Researchers also reviewed more than 1,100 professional artist websites to see how much control artists had over AI-blocking tools. Finally, the team evaluated which processes were the most effective at blocking AI crawlers. Currently, artists can fairly easily use some tools that mask original artworks from AI crawlers by turning the art into something different. The study's co-authors at the University of Chicago developed one of these tools, known as Glaze. But ideally, artists would be able to keep AI crawlers from harvesting their data altogether. To do so, visual artists need to defend themselves against three categories of AI crawlers. One type harvests data to train the large language models that power chatbots, another to increase the knowledge of AI-backed assistants, and yet another to support AI-backed search engines. Researchers will present their work at the ACM Internet Measurement Conference in October of this year in Madison, Wisc. Artist survey There has been extensive media coverage of how generative AI has severely disrupted the livelihoods of many artists. As a result, close to 80% of the 203 visual artists the researchers surveyed said they have tried to take proactive steps to keep their artwork from being included in training data for AI generating tools. Two-thirds reported using Glaze. In addition, 60% of artists have cut back on the amount of work they share online, and 51% of them share only low-resolution images of their work. Also, 96% of artists said they would like to have access to a tool that can deter AI crawlers from harvesting their data. But more than 60% of them were not familiar with one of the simplest tools that can do this: robots.txt. Tools for Deterring AI Crawlers Robots.txt is a simple text file placed in the root directory of a website that spells out which pages crawlers are allowed to access on that website. The text file can also spell out which crawlers are not allowed to have access to the website at all. But the crawlers have no obligation to follow these restrictions. Researchers surveyed the top 100,000 most popular websites on the Internet and found that more than 10% have explicitly disallowed AI crawlers in their robots.txt files. But some sites, including Vox Media and The Atlantic, removed this prohibition after entering into licensing agreements with AI companies. Indeed, the number of sites allowing AI crawlers is increasing, including popular right-wing misinformation sites. Researchers hypothesize that these sites might seek to spread misinformation to LLMs. One issue for artists is that they do not have access to or control of the relevant robots.txt file. That's because, in a survey of 1100 artist websites, researchers found that more than three quarters are hosted on third-party service platforms, most of which do not allow for modifications of robots.txt. Many of these content management systems artists use also give them little to no information about what type of crawling is blocked. Squarespace is the only company that provides a simple interface for blocking AI tools. But researchers found that only 17% of artists who use Squarespace enable this option. This might be because often, artists are not aware that this service is available. But do crawlers respect the prohibitions listed in robots.txt, even though they are not mandatory? The answer is mixed. Crawlers from big corporations generally do respect robots.txt, both in claim and in practice. The only crawler that researchers could clearly determine does not is Bytespider, deployed by TikTok owner ByteDance. In addition, a large number of crawlers claim they respect robots.txt restrictions but researchers were not able to verify that this is actually the case. All in all, "the majority of AI crawlers operated by big companies do respect robots.txt, while the majority of AI assistant crawlers do not," the researchers write. More recently, network provider Cloudflare has launched a "block AI bots" feature. At this point, only 5.7% of the sites using Cloudflare have this option enabled. But researchers hope it will become more popular over time. "While it is an 'encouraging new option', we hope that providers become more transparent with the operation and coverage of their tools (for example by providing the list of AI bots that are blocked)," said Elisa Luo, one of the paper's authors and a Ph.D. student in Savage's research group. Legislative and legal uncertainties The global landscape around AI crawlers is constantly changing due to different legal changes and a wide range of legislative proposals. In the United States, AI companies face legal challenges around the extent to which copyright applies to models trained on data scraped from the Internet and what their obligations might be to the creators of this content. In the European Union, a recently passed AI Act requires providers of AI models to get authorization from copyright holders to use their data. "There is reason to believe that confusion around the availability of legal remedies will only further focus attention on technical access controls," the researchers write. "To the extent that any U.S. court finds an affirmative 'fair use' defense for AI model builders, this weakening of remedies on use will inevitably create an even stronger demand to enforce controls on access." The work was partially funded by NSF grant SaTC-2241303 and the Office of Naval Research project #N00014-24-1-2669. Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers Enze Alex Liu, Elisa Luo, Geoffrey M. Voelker, and Stefan Savage, Department of Computer Science and Engineering at the University of California San Diego Shawn Shan, Ben Y. Zhao, University of Chicago
Share
Copy Link
A study reveals that while visual artists want to protect their work from AI crawlers, they lack the technical expertise and access to necessary tools. The research highlights the complexities of digital content protection in the age of AI.
In an era where artificial intelligence is reshaping the creative landscape, visual artists find themselves in a precarious position. A recent study by researchers from the University of California San Diego and the University of Chicago has shed light on the challenges artists face in protecting their work from non-consensual use by generative AI tools 1.
Source: Tech Xplore
The study, set to be presented at the 2025 Internet Measurement Conference, reveals a stark reality: nearly 80% of the 203 visual artists surveyed have attempted to take proactive measures to prevent their artwork from being included in AI training data. An overwhelming 96% expressed a desire for tools that can deter AI crawlers from harvesting their data 2.
One of the simplest tools available is the robots.txt file, which can specify which crawlers are allowed to access a website. However, more than 60% of artists surveyed were unfamiliar with this tool 1. Moreover, the effectiveness of robots.txt is limited:
Network provider Cloudflare has introduced a "block AI bots" feature, but its adoption is still low, with only 5.5% of Cloudflare-using sites enabling this option 1.
Source: newswise
The study found that while most AI crawlers operated by large companies respect robots.txt restrictions, the majority of AI assistant crawlers do not. Notably, Bytespider, deployed by TikTok owner ByteDance, was identified as a crawler that does not respect these restrictions 2.
In response to the threat of AI harvesting, artists have adopted various strategies:
The research team's review of over 1,100 professional artist websites revealed that most are hosted on third-party service platforms. Squarespace stands out as the only company providing a simple interface for blocking AI tools, yet only 17% of artists using Squarespace enable this option 2.
The study highlights a fundamental shift in content creators' needs. As the researchers note, "content creators now wish to control how their content is used, not simply if it is accessible" 1. This desire for control extends beyond mere accessibility, touching on issues of copyright and digital rights management in an AI-driven world.
As the landscape of AI and creative work continues to evolve, the need for more sophisticated, user-friendly, and effective tools to protect artists' work becomes increasingly apparent. The research underscores the urgency for developers, platforms, and policymakers to address these concerns and provide artists with the means to safeguard their creative output in the digital age.
NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.
9 Sources
Technology
13 hrs ago
9 Sources
Technology
13 hrs ago
Google's Made by Google 2025 event showcases the Pixel 10 series, featuring advanced AI capabilities, improved hardware, and ecosystem integrations. The launch includes new smartphones, wearables, and AI-driven features, positioning Google as a strong competitor in the premium device market.
4 Sources
Technology
13 hrs ago
4 Sources
Technology
13 hrs ago
Palo Alto Networks reports impressive Q4 results and forecasts robust growth for fiscal 2026, driven by AI-powered cybersecurity solutions and the strategic acquisition of CyberArk.
6 Sources
Technology
13 hrs ago
6 Sources
Technology
13 hrs ago
OpenAI updates GPT-5 to make it more approachable following user feedback, sparking debate about AI personality and user preferences.
6 Sources
Technology
21 hrs ago
6 Sources
Technology
21 hrs ago
President Trump's plan to deregulate AI development in the US faces a significant challenge from the European Union's comprehensive AI regulations, which could influence global standards and affect American tech companies' operations worldwide.
2 Sources
Policy
5 hrs ago
2 Sources
Policy
5 hrs ago