Curated by THEOUTPOST
On Mon, 29 Jul, 8:00 AM UTC
2 Sources
[1]
Anthropic Conducted 'Egregious' Data Scraping, Freelancer.com CEO Says
Such actions could violate publishers' terms of services, the Financial Times (FT) reported Friday (July 26), citing interviews with those website owners. Data scraping is the automated process of pulling information from websites or other digital sources, often without the express permission of the content owners. Companies that generate content have a vested interest in safeguarding that material to maintain revenues. As the FT noted, Anthropic was founded by former OpenAI researchers who wanted to develop "responsible" AI systems. But Matt Barrie, CEO of Freelancer.com, accused the startup of being "the most aggressive scraper by far" of his freelance work portal, which gets millions of visits per day. The site got 3.5 million visits from an Anthropic-linked web "crawler" in the span of four hours, according to data shared with the FT. That makes Anthropic "probably about five times the volume of the number two" AI crawler, Barrie said. "We had to block them because they don't obey the rules of the internet," Barrie said. "This is egregious scraping [which] makes the site slower for everyone operating on it and ultimately affects our revenue." According to the report, other web publishers say Anthropic is swarming their sites and ignoring their requests to cease collecting their content to train its models. PYMNTS has contacted Anthropic for comment but has not yet received a reply. The company told the FT it was looking into the Freelancer.com case and that it endeavored not to be "intrusive or disruptive." The dispute is happening at a moment when, as PYMNTS wrote earlier this month, "businesses are grappling with the unauthorized harvesting of their online content, prompting new defensive measures that could reshape the digital landscape." For example, the web infrastructure company Cloudflare recently unveiled a new tool against content scraping that could derail major AI companies' training operations. "The software is designed to prevent automated data collection and has the potential to reshape how AI models are developed and trained," that report said. And as companies scramble to safeguard their digital assets, industry experts predict a spike in demand for similar protective measures, potentially bringing about a new market for anti-AI scraping services. When a businesses' "information is scraped, especially in near real time, it can be summarized and posted by an AI over which they have no control, which in turn deprives the content creator of getting its own clicks -- and the attendant revenue," HP Newquist, executive director of The Relayer Group and author of "The Brain Makers," told PYMNTS.
[2]
Freelancer boss accuses AI giant of 'egregious scraping'
Other web publishers have echoed Mr Barrie's concerns that Anthropic is swarming their sites and ignoring their instructions to stop collecting their content to train its models. Freelancer.com received 3.5 million visits from an Anthropic-linked web "crawler" in the space of four hours, according to data shared with the Financial Times. That makes Anthropic "probably about five times the volume of the number two" AI crawler, Mr Barrie said. Visits from its bot continued to increase even after Freelancer.com attempted to refuse its access requests, using standard web protocols for guiding crawlers, he added. After that, Mr Barrie decided to block traffic from Anthropic's internet addresses altogether. "We had to block them because they don't obey the rules of the internet," Mr Barrie said. "This is egregious scraping [which] makes the site slower for everyone operating on it and ultimately affects our revenue." Anthropic said it was investigating the case and that it respected publishers' requests and aimed not to be "intrusive or disruptive". Scraping publicly available data from across the web is generally legal. But the practice is contentious, can breach websites' terms of service and can be costly for site hosts. Kyle Wiens, chief executive of iFixit.com, said his electronic repairs site received 1 million hits from Anthropic bots within 24 hours. "We have a load of alarms [for high traffic], people get woken up at 3am. This set off every alarm we have," he said. iFixit's terms of service prohibited the use of its data for machine learning, Mr Wiens said. "My first message to Anthropic is: if you're using this to train your model, that's illegal. My second is: this is not polite internet behaviour. Crawling is an etiquette thing." Websites use a protocol known as 'robots.txt' to try to keep crawlers and other web robots off portions of their sites. However, it relies on voluntary compliance. "We respect robots.txt and our crawler respected that signal when iFixit implemented it," said Anthropic. The company also said its crawlers respected "anti-circumvention technologies" such as CAPTCHAs, and that "our crawling should not be intrusive or disruptive. We aim for minimal disruption by being thoughtful about how quickly we crawl the same domains". Data scraping is not a new practice but it has ramped up dramatically in the last two years as a result of the AI arms race. That has imposed new costs on websites. "AI crawlers have cost us a significant amount of money in bandwidth charges, and caused us to spend a large amount of time dealing with abuse," wrote Eric Holscher, co-founder of document hosting website Read the Docs in a blog post on Thursday. "AI crawlers are acting in a way that is not respectful to the sites they are crawling, and that is going to cause a backlash against AI crawlers in general." Anthropic has created some of the world's most advanced chatbots -- rivalling OpenAI's ChatGPT -- which can respond to an array of prompts in natural language, while positioning itself as a more ethical actor than some rivals. Anthropic's stated purpose is "the responsible development and maintenance of advanced AI for the long-term benefit of humanity". As leading AI companies compete to create evermore capable and dexterous models, they are pushing deeper into untapped corners of the web, partnering with publishers or creating synthetic training data. OpenAI has struck a number of deals in recent months with publishers and content providers including Reddit, The Atlantic and The Financial Times. Anthropic has not publicly announced similar partnerships. "The search engines have always done a lot of scraping," Mr Barrie said, "but it's gone up a whole level with training generative AI." iFixit's mission "is to give information away", said Wiens, to encourage people to repair their own. "We're not opposed to them using our content to train models, we just want to be part of the conversation." He added: "I'm not a crusader on this topic, I'm just trying to keep a website online."
Share
Share
Copy Link
Freelancer.com's CEO Matt Barrie alleges that AI company Anthropic engaged in unauthorized data scraping from their platform. The accusation raises questions about data ethics and the practices of AI companies in training their models.
Matt Barrie, the CEO of Freelancer.com, has leveled serious accusations against artificial intelligence company Anthropic, claiming that the AI firm engaged in "egregious" data scraping from the Freelancer.com platform 1. The allegations have sparked a heated debate about data ethics and the practices employed by AI companies in training their large language models.
According to Barrie, Anthropic allegedly scraped a significant amount of data from Freelancer.com without proper authorization. He stated that the AI company accessed "millions" of pages from the platform, potentially harvesting valuable information about freelancers, their portfolios, and project details 2. This unauthorized data collection, if proven true, could have serious implications for both the affected users and the AI industry at large.
Anthropic has yet to provide a detailed response to these allegations. The company's silence on the matter has only fueled further speculation and concern within the tech community. Legal experts suggest that if the accusations are substantiated, Anthropic could face significant legal challenges, including potential violations of data protection laws and intellectual property rights 1.
The alleged data scraping raises serious concerns for Freelancer.com and its user base. If sensitive information about freelancers and their work has indeed been harvested without consent, it could potentially compromise the privacy and intellectual property of millions of professionals who use the platform 2. Barrie expressed particular concern about the potential misuse of this data in training AI models, which could unfairly compete with human freelancers in the future.
This incident has brought to the forefront the ongoing debate about the ethics of data collection practices in the AI industry. As companies race to develop more advanced AI models, the methods used to acquire training data have come under increased scrutiny. The allegations against Anthropic highlight the need for clearer regulations and industry standards regarding data scraping and the use of publicly available information for AI training purposes 1.
In light of these accusations, there have been renewed calls for greater transparency from AI companies about their data collection and model training practices. Industry experts and policymakers are emphasizing the need for robust regulations that protect individual privacy and intellectual property rights while still allowing for innovation in the AI field 2. The incident may serve as a catalyst for more stringent oversight of AI development practices in the future.
Reference
[2]
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Companies are increasingly blocking AI web crawlers due to performance issues, security threats, and content guideline violations. These new AI-powered bots are more aggressive and intelligent than traditional search engine crawlers, raising concerns about data scraping practices and their impact on websites.
2 Sources
2 Sources
A group of authors has filed a lawsuit against AI company Anthropic, alleging copyright infringement in the training of their AI chatbot Claude. The case highlights growing concerns over AI's use of copyrighted material.
14 Sources
14 Sources
Cloudflare introduces new bot management tools allowing website owners to control AI data scraping. The tools enable blocking, charging, or setting conditions for AI bots accessing content, potentially reshaping the landscape of web data collection.
13 Sources
13 Sources
Apple's efforts to train its AI models using web content are meeting opposition from prominent publishers. The company's web crawler, Applebot, has been increasingly active, raising concerns about data usage and copyright issues.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved