Curated by THEOUTPOST
On Tue, 23 Jul, 12:01 AM UTC
3 Sources
[1]
Crisis Looms as AI Companies Rapidly Losing Access to Training Data
AI companies typically build their AI models on lots of publicly available content, from YouTube videos to newspaper articles. But many of these content hosts have now started to put up restrictions on their content. Those new restrictions could bring about a "crisis" that would make these AI models less effective, according to a new study by the Massachusetts Institute of Technology's Data Provenance Initiative. The researchers performed an audit of 14,000 websites that are scraped by prominent AI training data sets. The intriguing result: that about 28 percent "of the most actively maintained, critical sources" on the internet are now "fully restricted from use." The administrators of these websites have made these restrictions by adding increasingly stringent limitations to how web crawler bots are allowed to scrape their content. "If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems," the researchers write. It's understandable that content hosts would put restrictions on their cache of now-valuable data. AI companies have taken this publicly available material, much of it copyrighted, and are using it to make money without permission. This has understandably upset many, from The New York Times to celebrities like Sarah Silverman. What's particularly galling is that people like OpenAI CTO Mira Murati are saying that some creative jobs should disappear -- even though it's the content made by these creative people that power models like OpenAI's ChatGPT. The arrogance on display, and the resulting blowback, have created a "consent in crisis," as the study researchers call it -- meaning the once free-willing internet with no walls is becoming a thing of the past, and AI models will be more biased, less diverse and less fresh. Some companies are now hoping to work around these constraints by using synthetic data, which is essentially data generated by AI, but so far that's been a poor substitute to original content produced by actual human beings. Others, like OpenAI, have struck deals with media companies, but many have expressed alarm at these agreements -- for good reason, because the goals of tech companies and media outfits are at odds. Time will tell how the whole thing shakes out. One thing's for sure, though: stockpiles of training data are becoming more valuable -- and scarce -- than ever.
[2]
Data Owners Are Increasingly Blocking AI Companies From Using Their IP
Training data for generative AI models like Midjourney and ChatGPT is beginning to dry up, according to a new study. The world of artificial intelligence moves fast. While court cases attempt to decide whether using copyrighted text, images, and video to train AI models is "fair use", as tech companies argue, those same firms are already running out of new data to harvest. As generative AI has proliferated and become well-known, there has been a well-documented backlash and many have taken action by denying access to their online data -- including photographers. An MIT research group led the study which looked at 14,000 web domains that are included in three major AI training data sets. The study, published by the Data Provenance System, discovered an "emerging crisis in consent" as online publishers pull up the drawbridge by not giving permission to AI crawlers. The researchers looked at the C4, RefineWeb, and Dolma data sets and found that five percent of all the data is now restricted. But that number jumps to 25 percent when looking at the highest-quality sources. Generative AI needs a good caliber of data to produce good models. Robot.txt, a decades-old method for website owners to stop automated bots from crawling their pages, is increasingly being deployed to block tech companies from collecting data. According to The New York Times, some AI executives worry about hitting the "data wall". Essentially, data owners, such as photographers, have become distrustful toward the AI industry and are making things difficult. The AI industry has long been accused of profiteering from the work of artists, a theme that is subject to a number of ongoing lawsuits including those brought by photographers against the likes of Google, Midjoureny, and Stable Diffusion. However, robots.txts files are not legally binding. The Times describes them as like a "no trespassing" sign for data but there is no way of actually enforcing it. OpenAI, which operates DALL-E and ChatGPT, says it respects robots.txt. So do major search engines and Anthropic. However, other players have been accused of ignoring them. "Unsurprisingly, we're seeing blowback from data creators after the text, images, and videos they've shared online are used to develop commercial systems that sometimes directly threaten their livelihoods," says Yacine Jernite, a machine learning researcher at Hugging Face. However, there is a concern that if all AI training data needs to be obtained via a licensing deal then some players like researchers and civil society will be excluded from participating in the technology.
[3]
AI training data pool shrinks as sites ban creepy crawlers
Shrinks training pool, but hurts services like the Internet Archive The internet is becoming significantly more hostile to webpage crawlers, especially those operated for the sake of generative AI, researchers say. The Data Provenance Initiative in their study titled "Consent in Crisis" looked into the domains scanned in three of the most important datasets used for training AI models. Training data usually includes publicly available info from all sorts of websites, but giving the public access to data isn't the same as giving consent for collecting it automatically using a crawler. Crawling for data, also known as scraping, has been around much longer than generative AI, and websites already had rules on what crawlers could and couldn't do. These rules are contained in the robots.txt standard (basically an honor code for crawlers) as well as websites' terms and conditions. The researchers examined the whole datasets - C4, Dolma, and RefinedWeb - as well as their most used domains. The data shows that websites have reacted to the introduction of AI crawlers in 2023. Specifically, OpenAI's GPTBot and Google's Google-Extended crawlers immediately triggered websites to start changing their robots.txt restrictions. Today, between 20 and 33 percent of the top domains have enacted complete restrictions on crawlers, as opposed to just a few percent in early 2023. Across the whole body of domains, only 1 percent enforced restrictions prior to mid-2023; now 5-7 percent have done so. Some websites are also changing their terms of service to completely ban both crawling and using hosted content for generative AI, though the change isn't nearly as drastic as it is with robots.txt. When it comes to whose crawlers are getting blocked, OpenAI is by far in the lead, having been banned from 25.9 percent of top sites. Anthropic and Common Crawl have been kicked out of 13.3 percent, while crawlers from Google, Meta, and others are restricted at less than 10 percent of domains. As for what sites are putting up barriers to AI crawlers, it's largely news sites. Among all domains, news publications were by far the most likely to have terms of service (ToS) and robots.txt settings restricting AI crawlers. However, for the top domains specifically, social media platforms and forums (think Facebook and X) were just as likely to restrict crawlers via the terms of service as news publications. Although it's clear lots of websites don't want their content being scraped for use in AI, the Data Provenance Initiative says they're not communicating that effectively. Part of this is down to the restrictions in robots.txt and the ToS not lining up. 34.9 percent of the top training websites make it clear in the ToS that crawling isn't allowed, but fail to mirror that in robots.txt. On the other hand, websites with no ToS at all are surprisingly likely to set up partial or complete blocks on crawlers. And when crawling is banned, websites tend to just ban OpenAI, Common Crawl, and Anthropic. The study also found some websites fail to correctly identify and restrict certain crawlers. 4.5 percent of sites banned Anthropic-AI and Claude-Web instead of Anthropic's actual crawler ClaudeBot. Plus, there are bots for collecting training materials but also those for grabbing up-to-date info, and the distinction might not always be clear to website operators. So while GPTBot is banned on some domains, ChatGPT-User isn't, even though they're both used for crawling. Obviously, sites locking down their data will negatively impact AI model training, especially since the websites most likely to crack down tend to have the highest quality data. But the team points out that crawlers are used by academia and nonprofits like the Internet Archive, and are getting caught in the crossfire. The study also brings up the possibility that AI firms might have wasted their time crawling so hard they're getting banned. While almost 40 percent of the top domains used in the three datasets were news-related, over 30 percent of ChatGPT inquiries were for creative writing, compared to about 1 percent that concerned news. Other common requests were for translation, coding assistance, general information, and sexual roleplay, which was in second place. The researchers say the traditional structure of robots.txt and ToS aren't capable of accurately defining rules in the age of AI. Part of the problem is that enforcing a total ban is the easiest solution, since robots.txt is mostly useful for blocking specific crawlers rather than communicating certain rules, like what crawlers are allowed to do with collected data. Until that happens, however, the current trajectory of AI data scraping could affect how the web is structured, which is likely to be less open than it was before. ®
Share
Share
Copy Link
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
In a surprising turn of events, artificial intelligence (AI) companies are facing an unexpected hurdle: a shrinking pool of training data. As reported by multiple sources, data owners are increasingly blocking AI firms from accessing their intellectual property (IP) for training purposes, leading to what some are calling a "data drought" 1.
The trend of data restriction is gaining momentum across various sectors. Content creators, publishers, and other IP holders are becoming more protective of their assets, recognizing the value of their data in the AI ecosystem. This shift is partly driven by concerns over copyright infringement and the potential misuse of their content in AI-generated works 2.
The consequences of this data scarcity are significant for AI companies. With less diverse and comprehensive training data available, the development of future AI models could be hampered. Experts warn that this could lead to less accurate and less capable AI systems, potentially slowing down the rapid advancements we've seen in recent years 3.
The situation has brought to the forefront legal and ethical questions surrounding the use of data for AI training. Some data owners argue that their content has been used without proper compensation or consent, leading to calls for more stringent regulations and fair use policies in the AI industry 2.
In response to these challenges, AI companies are exploring alternative strategies. Some are considering partnerships with data owners, offering compensation or other incentives for access to high-quality training data. Others are investigating synthetic data generation techniques to supplement their training sets 1.
As the landscape of AI training data continues to evolve, industry observers predict a shift towards more ethical and transparent data acquisition practices. This may lead to a new era of collaboration between AI firms and content creators, potentially resulting in more balanced and fair AI development processes 3.
The ongoing "data drought" serves as a reminder of the complex interplay between technological advancement, intellectual property rights, and ethical considerations in the rapidly evolving field of artificial intelligence. As the situation unfolds, it will undoubtedly shape the future trajectory of AI development and deployment across various industries.
Reference
[3]
As AI technology advances, the critical data needed to train these systems is vanishing at an alarming rate. This shortage poses significant challenges for the future development of artificial intelligence.
2 Sources
2 Sources
Ilya Sutskever, co-founder of OpenAI, warns that AI development is facing a data shortage, likening it to 'peak data'. This crisis could reshape the AI industry's future, forcing companies to seek alternative solutions.
3 Sources
3 Sources
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
2 Sources
2 Sources
Researchers warn that the proliferation of AI-generated web content could lead to a decline in the accuracy and reliability of large language models (LLMs). This phenomenon, dubbed "model collapse," poses significant challenges for the future of AI development and its applications.
8 Sources
8 Sources
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved