Curated by THEOUTPOST
On Fri, 9 Aug, 4:03 PM UTC
2 Sources
[1]
The AI world's most valuable resource is running out, and it's scrambling to find an alternative: 'fake' data
This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? Log in. Now, the supply of "real," human-generated data is running dry. Research firm Epoch AI predicts textual data could run out by 2028. Meanwhile, companies that have mined every corner of the internet for usable training data -- sometimes breaking their policies to do so -- face increased restrictions on what remains. To some, that's not necessarily a problem. OpenAI CEO Sam Altman has argued that AI models should eventually produce synthetic data good enough to train themselves effectively. The allure is obvious: training data has become one of the most precious resources in the AI boom, and the possibility of generating it cheaply and seemingly infinitely is tantalizing. Still, researchers debate whether synthetic data is the magic bullet, with some arguing this path could lead to AI models poisoning themselves with poor-quality information and a "collapse" as a result. A recent paper published by a group of Oxford and Cambridge researchers discovered that feeding a model with AI-generated data eventually led it to produce gibberish. AI-generated data was not unusable for training, the authors claimed, but it should be balanced with real-world data. As the well of usable human-generated data dries up, more companies look into using synthetic data. In 2021, research firm Gartner predicted that by 2024, 60% of data used for developing AI would be synthetically generated. "It's a crisis," said Gary Marcus, an AI analyst and emeritus professor of psychology and neural science at New York University. "People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can." "Yes, it will help you with some problems, but the deeper problem is that these systems don't really reason, they don't really plan," Marcus added. "All the synthetic data you can imagine is not going to solve that foundational problem." The need for "fake" data hinges on the notion that real-world data is fast running out. This is partly because tech firms have been moving as fast as possible to use publicly available data to train AI in an effort to outsmart rivals. It's also because online data owners have become increasingly wary of companies taking their data for free. OpenAI researchers revealed in 2020 how they used free data from Common Crawl, a web crawler that contains "nearly a trillion words" from online resources, to train the AI model that would eventually power ChatGPT. Research published in July by MIT's Data Provenance Initiative found websites now putting restrictions in place to stop AI firms from using data that didn't belong to them. News publications and other top sites are increasingly blocking AI companies from freely cribbing their data. To get around this problem, companies such as OpenAI and Google are cutting checks worth tens of millions of dollars for access to data from Reddit and news outlets, which act as conveyor belts of fresh data for training models. Even this has its limitations. "There are no longer major areas of the textual web just waiting to be grabbed," Nathan Lambert, a researcher at the Allen Institute for AI, wrote in May. This is where synthetic data comes in. Rather than being pulled from the real world, synthetic data is generated by AI systems that have been trained on real-world data. In June, for instance, Nvidia released an AI model that can create artificial datasets for training and alignment. In July, researchers at Chinese tech giant Tencent created a synthetic data generator called Persona Hub, which does a similar job. Some startups, such as Gretel and SynthLabs, are even popping up with the sole purpose of generating and selling troves of specific types of data to companies that need it. Proponents of synthetic data offer fair reasons for its use. Like the real world, human-generated data is often messy, leaving researchers with the complex and laborious task of cleaning and labeling it before it can be used. Synthetic data can potentially fill holes that human data cannot fill. In late July, Meta introduced Llama 3.1, a new series of AI models that generate synthetic data and rely on it for "finetuning" in training. In particular, it used the data to improve the performance of specific skills, such as coding in languages like Python, Java, and Rush, as well as solving math problems. Synthetic training could be particularly effective for smaller AI models. Microsoft last year said it gave OpenAI's models a diverse list of words that a typical 3-4 year old would know, and then asked it to generate short stories using that data. The resulting dataset was used to create a group of small but capable language models. Synthetic data may help offer some effective counter-tuning to the biases produced by real-world data, too. In their 2021 paper, "On the Dangers of Stochastic Parrots," former Google researchers Timnit Gebru, Margaret Mitchell, and others argued that LLMs trained on massive datasets of text from the internet would likely reflect the data's biases. In April, a group of Google DeepMind researchers published a paper championing the use of synthetic data to address problems around data scarcity and privacy concerns in training, adding that ensuring the accuracy and lack of bias in this AI-generated data "remains a critical challenge." While the AI industry found some advantages in synthetic data, it faces serious issues it can't afford to ignore, such as fears synthetic data can wreck AI models entirely. In Meta's research paper on Llama 3.1, the company said that training the 405 billion parameter version of the latest model "on its own generated data is not helpful," and may even "degrade performance." A new study published in the journal Nature last month found that "indiscriminate use" of synthetic data in model training can cause "irreversible defects." The researchers called this phenomenon "model collapse" and warned that the problem must be taken seriously "if we are to sustain the benefits of training from large-scale data scraped from the web." Jathan Sadowski, a senior research fellow at Monash University, coined a term for this idea: Habsburg AI, in reference to the Austrian dynasty that some historians believe destroyed itself through inbreeding. Since coining the term, Sadowski told BI he has felt validated by the research backing his assertion that models heavily trained on AI outputs can become mutated. "The open question for researchers and companies building AI systems is how much synthetic data is too much?" said Sadowski. "They need to find any possible solution to overcome the challenges of data scarcity for AI systems -- even if those solutions are just short-term fixes that could do more harm than good by creating low-quality systems." However, findings from a paper published in April showed that models trained on their own generated data don't necessarily need to "collapse" if they are trained with both "real" and synthetic data. Now, some companies are betting on a future of "hybrid data," where synthetic data is generated by using some real data in an effort to stop the model going off-piste. Scale AI, which helps companies label and test data, said the company is exploring "the direction of hybrid data," using both synthetic and non-synthetic data (Scale AI CEO Alexandr Wang recently declared: "Hybrid data is the real future.") AI may require entirely new approaches, as simply jamming more data into models may only go so far. A group of Google DeepMind researchers may have proven the merits of another approach in January when the company announced AlphaGeometry, an AI system that can solve geometry problems at an Olympiad level. In a supplemental paper, the researchers explained how AlphaGeometry uses a "neuro-symbolic" approach, which meshes the strengths of other AI approaches, landing somewhere between data-hungry deep-learning models and rule-based logical reasoning. IBM's research group said it sees it as "a pathway to achieve artificial general intelligence." What's more, in the case of AlphaGeometry, it was pre-trained on entirely synthetic data. The neuro-symbolic field of AI is still relatively young, and it remains to be seen if it will propel AI forward. Given the pressures companies such as OpenAI, Google, and Microsoft face in turning AI hype into profits, expect them to try every solution possible to solve the data crisis. "People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can," said Marcus. "We're still basically going to be stuck here unless we take new approaches altogether."
[2]
Synthetic Data Generation in Simulation is Keeping ML for Science Exciting
Simulations allow researchers to generate vast amounts of synthetic data, which can be critical when real-world data is scarce, expensive, or challenging to obtain. If only AI could create infinite streams of data for training, we wouldn't have to deal with the problem of not having enough data. This is what is keeping a lot of things undiscoverable in the field of science as there is only a limited amount of data available that can be used for training. This is where AI is taking up a crucial role with the help of simulation. The integration of data generation through simulation is rapidly becoming a cornerstone in the field of ML, especially in science. This approach not only holds promise but is also reigniting enthusiasm among researchers and technologists. As Yann LeCun pointed out, "Data generation through simulation is one reason why the whole idea of ML for science is so exciting." Simulations allow researchers to generate vast amounts of synthetic data, which can be critical when real-world data is scarce, expensive, or challenging to obtain. For instance, in fields like aerodynamics or robotics, simulations enable the exploration of scenarios that would be impossible to test physically. Richard Socher, the CEO of You.com, highlighted that while there are challenges, such as the combinatorial explosion in complex systems, simulations offer a pathway to manage and explore these complexities. This is similar to what Anthropic chief Dario Amodei said about producing quality data using synthetic data and that it sounds feasible to create an infinite data generation engine that can help build better AI systems. "If you do it right, with just a little bit of additional information, I think it may be possible to get an infinite data generation engine," said Amodei, while discussing the challenges and potential of using synthetic data to train AI models. "We are working on several methods for developing synthetic data. These are ideas where you can take real data present in the model and have the model interact with it in some way to produce additional or different data," explained Amodei. Taking the example of AlphaGo, Amodei said that those little rules of Go, the little additional piece of information, are enough to take the model from "no ability at all to smarter than the best human at Go". He noted that the model there just trains against itself with nothing other than the rules of Go to adjudicate. Similarly, OpenAI is a big proponent of synthetic data. The former team of Ilya Sutskever and Andrej Karpathy has been a significant force in leveraging synthetic data to build AI models. The development at OpenAI is testimony to the advanced growth of generative AI in the entire ecosystem, but not everyone agrees that they will be able to achieve AGI with the current methodology of model training. Likewise, Microsoft is also researching in this direction; its research on Textbooks Are All You Need is a testament to the power of synthetic data. Google's AlphaFold, which is spearheading protein fold prediction and creations for drug discovery, too can benefit immensely from synthetic data. At the same time, it can be scary to use this data for a sensitive field like science. However, the potential of simulations extends beyond mere data generation. Giuseppe Carleo, another expert in the field, emphasised that the most exciting aspect is not just fitting an ML model to data generated by an existing simulator. Instead, true innovation lies in training ML models to become advanced simulators themselves -- models that can simulate systems beyond the capabilities of traditional methods, all while remaining consistent with the laws of physics. This is becoming possible with synthetic data generated by agentic AI models, which are increasing in the field of AI. Models that can test, train, and fine-tune themselves using the data they created is something that is exciting for the future of AI research. Moreover, the discussion around simulations also touches on broader applications. Sina Shahandeh, a researcher in the field of biotechnology, for example, suggested that the ultimate simulation could model entire economies using an agent-based approach, a concept that is slowly becoming feasible. Despite the excitement, the field is not without its sceptics. Stephan Hoyer, a researcher with a cautious outlook on AGI, pointed out that simulating complex biological systems to the extent that training data becomes unnecessary would require groundbreaking advancements. He believes this task is far more challenging than achieving AGI. Similarly, Jim Fan, senior AI scientist at NVIDIA, said that while synthetic data is expected to have a noteworthy role, blind scaling alone will not suffice to reach AGI. When it comes to science, using synthetic data can be tricky. But its generation in simulation shows promise as it can be tried and tested without deploying in real-world applications. Besides, the possibility of it being infinite is what keeps ML exciting for researchers.
Share
Share
Copy Link
Synthetic data is emerging as a game-changer in AI and machine learning, offering solutions to data scarcity and privacy concerns. However, its rapid growth is sparking debates about authenticity and potential risks.
In the rapidly evolving world of artificial intelligence and machine learning, a new player is making waves: synthetic data. This artificially generated information is becoming increasingly crucial in training AI models, addressing data scarcity issues, and navigating privacy concerns. As the industry grows, it's sparking both excitement and debate among experts and ethicists alike 1.
Synthetic data is proving to be a powerful solution to one of the most persistent challenges in AI development: the need for vast amounts of high-quality, diverse data. Traditional methods of data collection can be time-consuming, expensive, and often fraught with privacy issues. Synthetic data offers a way to generate large datasets quickly and efficiently, without compromising individual privacy 1.
The impact of synthetic data extends beyond commercial applications. In scientific research, particularly in fields like physics and chemistry, synthetic data generation in simulations is keeping machine learning exciting and productive. These simulations allow researchers to explore complex phenomena and test hypotheses in ways that would be impossible or impractical in the real world 2.
As the potential of synthetic data becomes more apparent, a new industry is emerging around its creation and application. Companies specializing in synthetic data generation are attracting significant investment, with the market expected to grow substantially in the coming years. This growth is driven by the increasing demand for AI solutions across various sectors, from healthcare to finance 1.
However, the rise of synthetic data is not without controversy. Critics argue that the use of "fake" data could lead to biased or unreliable AI models. There are concerns about the authenticity and representativeness of synthetic datasets, particularly in sensitive applications like healthcare diagnostics or financial risk assessment. The industry is grappling with questions of transparency, accountability, and the potential for misuse 1.
As synthetic data continues to gain traction, researchers and industry leaders are working to address these concerns. Efforts are being made to develop standards and best practices for synthetic data generation and use. The goal is to harness the benefits of synthetic data while mitigating potential risks and ensuring the reliability of AI systems trained on this artificial information 1 2.
Reference
[1]
[2]
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
2 Sources
Capital One is revolutionizing its data management practices to create a robust, AI-ready data ecosystem. This move comes as the financial industry grapples with data scarcity challenges that impact AI innovation.
2 Sources
Researchers warn that the proliferation of AI-generated web content could lead to a decline in the accuracy and reliability of large language models (LLMs). This phenomenon, dubbed "model collapse," poses significant challenges for the future of AI development and its applications.
8 Sources
An in-depth look at three types of synthetic data methods, their applications, and how businesses can leverage them for innovation and problem-solving.
2 Sources
Experts raise alarms about the potential limitations and risks associated with large language models (LLMs) in AI. Concerns include data quality, model degradation, and the need for improved AI development practices.
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved