Curated by THEOUTPOST
On Sat, 20 Jul, 12:01 AM UTC
2 Sources
[1]
Data that powers artificial intelligence is disappearing a rapid pace
The study, which looked at 14,000 web domains that are included in three commonly used AI training data sets, discovered an "emerging crisis in consent," as publishers and online platforms have taken steps to prevent their data from being harvested. The researchers estimate that in the three data sets -- called C4, RefinedWeb and Dolma -- 5 per cent of all data, and 25 per cent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
[2]
A.I. Companies Are Running Out of Training Data: Study
In the past year, around 25 percent of data from high-quality sources has been restricted from major datasets used to train A.I. models. As the A.I. models developed by tech companies become larger, faster and more ambitious in their capabilities, they require more and more high-quality data to be trained on. Simultaneously, however, websites are beginning to crack down on the use of their text, images and videos in training A.I. -- a move that has restricted large swathes of content from datasets in what constitutes an "emerging crisis in data consent," according to a recent study published by the Data Provenance Initiative, a group led by researchers at the Massachusetts Institute of Technology (MIT). Sign Up For Our Daily Newsletter Sign Up Thank you for signing up! By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime. See all of our newsletters The study found that in the past year alone, a "rapid crescendo of data restrictions from web sources," set off by concerns regarding the ethical and legal challenges of A.I.'s use of public data, has restricted much of the web to both commercial and academic A.I. institutions. Between April 2023 and April 2024, 5 percent of all data and 25 percent of data from the highest quality sources has been restricted, the researchers found through looking at some 14,000 web domains used to assemble three major datasets known as C4, RefinedWeb and Dolma. Major A.I. companies typically collect data through automatic bots known as web crawlers, which explore the internet and record content. In the case of the C4 dataset, 45 percent of data has become restricted through website protocols preventing web crawlers from accessing content. These restrictions disproportionately affect crawlers from different tech companies and typically advantage "less widely known A.I. developers," according to the study. OpenAI's crawlers were restricted for nearly 26 percent of high-quality data sources, for example, while Google (GOOGL)'s crawler was disallowed from around 10 percent and Meta (META) from 4 percent. If such constraints weren't enough, the supply of public data to train A.I. models is expected to become exhausted soon. Given the current pace of companies working on improving A.I. models, developers could run out of data between 2026 to 2032, according to a study released in June by the research group Epoch A.I. A.I. companies are paying millions to acquire training data As Big Tech scrambles to find enough data to support their aggressive A.I. goals, some companies are striking deals with content-filled publications to gain access to their archives. OpenAI, for example, has reportedly offered publishers between $1 million to $5 million for such partnerships. The A.I. giant has already entered into deals with publications like the Atlantic, Vox Media, The Associated Press, the Financial Times, Time and News Corp to use their archives for A.I. model training, often offering the use of products like ChatGPT in return. To unlock new data, OpenAI has even considered using Whisper, its speech-recognition tool, to transcribe video and audio from websites like YouTube -- a method that has also been discussed by Google. Other A.I. developers like Meta, meanwhile, have reportedly looked into acquiring publishing companies like Simon & Schuster to obtain its large cache of books. Another possible solution to the A.I. data crisis is synthetic data, a term used to describe data generated by A.I. models instead of humans. OpenAI's Sam Altman brought up the method during an interview earlier this year where he noted that data from the Internet "will run out" eventually. "As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, I think it should be all right," he said. Some prominent A.I. researchers, however, believe fears over an emerging data crisis are overblown. Fei-Fei Li, a Stanford computer scientist often dubbed the "Godmother of A.I.," argued that data limitation concerns are a "very narrow view" while speaking at the Bloomberg Technology Summit in May. While constraints may be tightening around internet content, Li noted that a variety of alternative and pertinent data sources have yet to be tapped by A.I. For example, "the health care industry is not running out of data, nor are industries like education, so no, I don't think we are running out of data," she said.
Share
Share
Copy Link
As AI technology advances, the critical data needed to train these systems is vanishing at an alarming rate. This shortage poses significant challenges for the future development of artificial intelligence.
In a surprising turn of events, the artificial intelligence industry is facing an unexpected challenge: the rapid disappearance of training data. This essential resource, which forms the foundation of machine learning models, is becoming increasingly scarce, threatening the future development of AI technologies 1.
The scarcity of training data can be attributed to several factors. Firstly, the exponential growth of AI applications has led to an unprecedented demand for high-quality, diverse datasets. Secondly, stricter privacy regulations and growing public awareness about data protection have resulted in more restricted access to personal information 2.
This data shortage is already having significant repercussions across the AI industry. Companies are struggling to improve their existing models and develop new ones, as the lack of fresh, relevant data hinders their ability to train AI systems effectively. This situation is particularly challenging for smaller startups and research institutions that lack the resources to compete with tech giants for access to limited datasets 1.
In response to this crisis, researchers and companies are exploring innovative approaches to data acquisition and utilization. Some are turning to synthetic data generation, where artificial datasets are created to mimic real-world information. Others are investigating more efficient machine learning techniques that require less data, such as few-shot learning and transfer learning 2.
The data scarcity issue has also reignited debates about data ownership, privacy, and the ethical use of information in AI development. As companies become more desperate for data, there are concerns about potential breaches of privacy and the exploitation of personal information. Policymakers and industry leaders are grappling with the challenge of balancing innovation with data protection 1.
As the AI industry adapts to this new reality, experts predict a shift in focus towards more data-efficient algorithms and alternative training methods. Collaboration between academia, industry, and government bodies may become crucial in addressing the data shortage and ensuring the continued advancement of AI technologies 2.
The disappearing data phenomenon presents both challenges and opportunities for the AI field. While it may slow down progress in the short term, it could also drive innovation in data generation, collection, and utilization methods, potentially leading to more robust and ethical AI systems in the future.
Reference
[1]
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
Ilya Sutskever, co-founder of OpenAI, warns that AI development is facing a data shortage, likening it to 'peak data'. This crisis could reshape the AI industry's future, forcing companies to seek alternative solutions.
3 Sources
3 Sources
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
2 Sources
2 Sources
Capital One is revolutionizing its data management practices to create a robust, AI-ready data ecosystem. This move comes as the financial industry grapples with data scarcity challenges that impact AI innovation.
2 Sources
2 Sources
Elon Musk asserts that AI companies have depleted available human-generated data for training, echoing concerns raised by other AI experts. He suggests synthetic data as the future of AI model training, despite potential risks.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved