Curated by THEOUTPOST
On Tue, 17 Dec, 12:04 AM UTC
3 Sources
[1]
OpenAI Co-Founder Ilya Sutskever Rings Alarm Bells: AI's 'Fossil Fuel' Is Running Out As World Reaches 'Peak Data' - Microsoft (NASDAQ:MSFT), Meta Platforms (NASDAQ:META)
OpenAI co-founder Ilya Sutskever is sounding the alarm on a looming data crisis that could reshape the artificial intelligence industry's future. What Happened: Speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver on Friday, Sutskever warned that the critical resource powering AI development is running dry, reported the Observer. "Data is the fossil fuel of AI," Sutskever said at the conference. "We've achieved peak data and there will be no more." The warning comes amid growing evidence of data access restrictions. A study by the Data Provenance Initiative found that between 2023 and 2024, website owners blocked AI companies from accessing 25% of high-quality data sources and 5% of all data across major AI datasets. This scarcity is already forcing industry leaders to adapt. OpenAI CEO Sam Altman has proposed using synthetic data - information generated by AI models themselves - as an alternative solution. The company is also exploring enhanced reasoning capabilities through its new o1 model. See Also: Elon Musk's Neuralink Faces Fierce Competition As Precision Neuroscience Raises $102 Million To Advance Thought-Controlled Devices Why It Matters: The data shortage concerns echo recent observations from venture capital firm Andreessen Horowitz. Marc Andreessen noted that AI capabilities have plateaued, with multiple companies hitting similar technological ceilings. Sutskever, who left OpenAI earlier this year to launch Safe Superintelligence with $1 billion in backing from investors including Andreessen Horowitz and Sequoia Capital, believes AI will evolve beyond its data dependency. "Future AI systems will understand things from limited data, they will not get confused," he said, though he declined to specify how or when this would occur. The increasing difficulty in accessing diverse and high-quality datasets for AI training has prompted companies like OpenAI, Meta Platforms Inc META, NVIDIA Corp NVDA, and Microsoft Corp MSFT to adopt data scraping practices, though not without controversy. For example, Microsoft's LinkedIn was recently scrutinized for using user data to train its AI models before updating its terms of service. Similarly, Meta has been using publicly available social media posts from Europe to train its Llama large language models, though privacy concerns have prompted legal challenges. Nvidia, too, has been scraping videos from YouTube and Netflix, including those from popular tech YouTuber Marques Brownlee, to train its AI systems. While these companies argue their practices comply with copyright laws, the ethical implications of scraping data without explicit consent have raised alarm across the industry. Read Next: ChatGPT Search Goes Free: OpenAI Challenges Google's Unbroken Search Dominance Image Via Shutterstock Disclaimer: This content was partially produced with the help of AI tools and was reviewed and published by Benzinga editors. Market News and Data brought to you by Benzinga APIs
[2]
Ilya Sutskever Warns A.I. Is Running Out of Data -- Here's What Will Happen Next
"We've achieved peak data and there will be no more," said the OpenAI co-founder. Ilya Sutskever, an OpenAI co-founder and the company's former chief scientist, has played a key role in ushering in the technology's most pivotal breakthroughs. Underpinning such developments were troves of text, images and videos scraped from the Internet to train A.I. models and enhance their capabilities -- data that will soon run out, according to the researcher. Sign Up For Our Daily Newsletter Sign Up Thank you for signing up! By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime. See all of our newsletters "Data is the fossil fuel of A.I.," said Sutskever while speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver on Dec. 13. "We've achieved peak data and there will be no more." This means that pre-training -- the process of feeding models with mass amounts of information -- "will unquestionably end," added the researcher, who noted that A.I. developers are already looking into alternative solutions like synthetic data or models that improve responses by taking longer to think about potential answers. Sutskever, 38, first made a name for himself in 2012 when he helped develop the convolutional neural network architecture AlexNet. He also helped establish OpenAI in 2015 and oversaw the ChatGPT-maker's research efforts before departing earlier this year to launch his own startup, Safe Superintelligence. The startup recently raised $1 billion from investors like Andreessen Horowitz and Sequoia Capital. What's next for A.I.? While factors like compute power and algorithms -- key aspects for A.I. model training -- have continued to improve, data simply cannot keep pace, Sutskever said. "We have but one internet." Data-hungry A.I. developers, which have already sucked up mass amounts of online information from the internet, are starting to hit roadblocks from website owners. Between 2023 and 2024, 5 percent of all data and 25 percent of data from the highest quality sources were restricted across major A.I. datasets, according to a study from the Data Provenance Initiative. As their data wells run dry, A.I. leaders are desperately searching for pre-training replacements. Synthetic data, or data generated by A.I. models themselves, have been suggested as a solution by the likes of OpenAI CEO Sam Altman. Altman has also pointed to the reasoning capabilities of the company's new o1 model, which thinks through various responses before answering queries, as a roadmap to improving A.I. capabilities in the future. As they gain enhanced reasoning capabilities, A.I. systems will become more "agentic," said Sutskever, echoing the belief of other tech leaders that autonomous A.I. agents are the field's next big focus. Through his startup, Sutskever himself is currently focused on achieving a safe form of "superintelligence," a type of A.I. that thinks, reasons and can surpass human intelligence. A.I. that learns to reason and think on its own will inevitably give way to less predictable behavior from models. Such behavior can already be seen in chess A.I. models, said Sutskever, which "are unpredictable to the best human chess players." Future A.I. systems "will understand things from limited data, they will not get confused," said Sutskever. "I'm not saying how, by the way, and I'm not saying when -- I'm saying that it will."
[3]
AI Training Debate Raises Stakes for Digital Economy | PYMNTS.com
Leading AI researchers caution that training systems on internet data may be hitting their limits, raising concerns about the future of data-driven business models across the digital economy. Warnings by former OpenAI chief scientist Ilya Sutskever about data training constraints, as reported by Reuters, have rattled technology markets. Speaking at the NeurIPS conference, Sutskever emphasized the need for innovative approaches, such as AI-generated data and enhanced reasoning capabilities, to advance artificial intelligence. He predicted that future AI systems will possess human-like reasoning abilities, making their behavior less predictable and necessitating a shift in AI development strategies. But other experts argue current methods still have room to run, leaving companies to navigate competing visions of how to value and deploy AI systems that power everything from fraud detection to inventory management. "Internet data is running out, and AI companies are feeling the pressure," Arunkumar Thirunagalingam, senior manager of data and technical operations at the McKesson Corporation, told PYMNTS. "For years, they relied on scraping huge amounts of online content to train their systems. That worked for a while, but now the easy data is drying up. This shift is putting the spotlight on companies with unique data sources, like healthcare records or logistics information. It is no longer about how much data you can grab; it is about having the right kind of data." AI systems rely on vast amounts of data from the internet to train and improve. However, the pool of high-quality, diverse data is finite, and researchers may be nearing the limits of what's available. As models grow larger and demand more input, the risk of recycling similar information increases, leading to diminishing returns. Additionally, much of the internet's content is noisy or repetitive, reducing its usefulness for cutting-edge training. This scarcity challenges researchers to seek alternatives, like creating synthetic data, leveraging specialized datasets, or developing models that rely less on raw data and more on advanced reasoning capabilities. With less internet data to scrape, companies are getting creative, Thirunagalingam said. They turn to real-world sources like IoT devices and sensors to collect fresh information. Crowd-sourcing platforms are paying people to share their unique insights, creating even more options. "This shift is already making waves in farming, where AI uses real-time data to improve crop yields, and in urban planning, where city sensors help design smarter infrastructure," he added. "Companies that once sat on overlooked datasets are now finding new ways to monetize them, from partnerships to licensing deals. What seemed unimportant before is now a goldmine, sparking fresh ideas and business models." Komninos Chatzipapas, founder of HeraHaven AI, acknowledged that the industry is running into a data wall. "The biggest AI companies have basically already scraped everything on the internet," he told PYMNTS. "Also, a lot of the new internet content being published is itself AI-generated (which cannot be used for training as it will reinforce the existing biases these AI models have), and more and more publishers are blocking scraping bots like GPTBot from crawling their sites via their robots.txt." For pre-training AI models, Chatzipapas said, the data wall primarily affects unstructured training data, such as news articles and forum discussions. Pre-training is the initial phase of AI model development where the model learns general language patterns and knowledge from vast amounts of text data before being fine-tuned for specific tasks. "There is still work to be done on creating great structured data for training AI models," he added. This can be, for example, very complex math/science problems that are solved in a step-by-step manner so the AI model can learn to reason, he said. One solution to the data drought is emerging through deals with academic publishers, who are offering their scholarly articles in exchange for millions of dollars. Microsoft's recent $10 million deal with Taylor & Francis opened the floodgates for AI companies to tap into academic publishers' vast research archives.
Share
Share
Copy Link
Ilya Sutskever, co-founder of OpenAI, warns that AI development is facing a data shortage, likening it to 'peak data'. This crisis could reshape the AI industry's future, forcing companies to seek alternative solutions.
Ilya Sutskever, co-founder of OpenAI and former chief scientist, has sounded the alarm on a looming data crisis that could significantly impact the future of artificial intelligence (AI) development. Speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver, Sutskever warned that the critical resource powering AI development is running dry 1.
"Data is the fossil fuel of AI," Sutskever stated. "We've achieved peak data and there will be no more." This stark assessment highlights the growing concern that the AI industry may be approaching the limits of available high-quality data for training advanced models 2.
The warning comes amid mounting evidence of data access restrictions. A study by the Data Provenance Initiative found that between 2023 and 2024, website owners blocked AI companies from accessing 25% of high-quality data sources and 5% of all data across major AI datasets 1.
This scarcity is already forcing industry leaders to adapt. OpenAI CEO Sam Altman has proposed using synthetic data - information generated by AI models themselves - as an alternative solution. The company is also exploring enhanced reasoning capabilities through its new o1 model 1.
The data shortage is prompting AI developers to seek innovative approaches to advance artificial intelligence. Sutskever predicts that future AI systems will possess human-like reasoning abilities, making their behavior less predictable and necessitating a shift in AI development strategies 3.
"Future AI systems will understand things from limited data, they will not get confused," Sutskever said, though he declined to specify how or when this would occur 2.
As the pool of high-quality, diverse data becomes finite, companies are exploring various alternatives:
Synthetic Data: AI-generated information to supplement training datasets 1.
Enhanced Reasoning Capabilities: Developing models that rely less on raw data and more on advanced reasoning, like OpenAI's o1 model 1.
Real-world Data Sources: Leveraging IoT devices and sensors to collect fresh information 3.
Crowd-sourcing Platforms: Paying people to share unique insights 3.
Academic Partnerships: Deals with academic publishers to access scholarly articles, such as Microsoft's recent $10 million agreement with Taylor & Francis 3.
The data crisis is raising concerns about the future of data-driven business models across the digital economy. Companies with unique data sources, such as healthcare records or logistics information, may find new opportunities to monetize their datasets through partnerships or licensing deals 3.
As the AI industry grapples with these challenges, the focus is shifting from quantity to quality of data. This transition is likely to spark fresh ideas and business models, potentially reshaping the landscape of AI development and application across various sectors.
Reference
As AI technology advances, the critical data needed to train these systems is vanishing at an alarming rate. This shortage poses significant challenges for the future development of artificial intelligence.
2 Sources
2 Sources
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
2 Sources
2 Sources
Elon Musk asserts that AI companies have depleted available human-generated data for training, echoing concerns raised by other AI experts. He suggests synthetic data as the future of AI model training, despite potential risks.
5 Sources
5 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
AI experts warn of diminishing returns in AI development due to the exhaustion of available digital text data, potentially leading to a slowdown in chatbot improvements and necessitating new approaches in AI research.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved