Curated by THEOUTPOST
On Fri, 2 Aug, 4:05 PM UTC
2 Sources
[1]
Can Synthetic Data Help Solve Generative A.I.'s Training Data Crisis?
Synthetic data provides a way around issues like intellectual property litigation but comes with its own risks. The supply of quality, real-world data used to train generative A.I. models appears to be dwindling as digital publishers increasingly restrict their access to their public data, according to a recent study. That means the advancement of large language models like OpenAI's GPT-4 and Google's Gemini could hit a wall once the A.I.s scrape all the remaining data on the internet. Sign Up For Our Daily Newsletter Sign Up Thank you for signing up! By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime. See all of our newsletters To address the growing A.I. training data crisis, some experts are considering synthetic data as a potential alternative. Real-world data, created by real humans, include news articles, YouTube videos, and other text and image content forms. Synthetic data, on the other hand, is artificially generated by machine learning models based on samples of real data. While synthetic data isn't particularly new, using it to train A.I. models like GPT is a technique major companies including OpenAI are exploring -- a practice experts say could backfire if done incorrectly. "It's still kind of the Wild West when it comes to generative A.I. models," Kjell Carlsson, head of AI strategy at Domino Data Lab, a machine learning platform for businesses, told Observer. How synthetic data can be used to train generative A.I. Synthetic data has long been used to address the lack of sufficient training data for A.I. applications such as autonomous driving systems. For instance, companies like Waymo and Tesla use synthetic data to train their systems to respond to a wide range of road conditions. Now, some experts believe there are creative ways synthetic data can be used to train generative A.I. models. Synthetic data generated by large models like OpenAI's GPT-4 can potentially be used to fine-tune smaller, more specialized models, according to Carlsson. For instance, automaker advertisers may use ChatGPT to generate customer profiles of middle-aged women from Minneapolis who own cars. That data can then be used to train a smaller model representing that customer segment to create targeted ads. Additionally, LLMs that are good at translation can produce an abundance of training data in other languages to "boost the performance of a different LLM" with those languages, Carlsson said. "Synthetic data plays a crucial role in enhancing our large language models," Jigyasa Grover, a former machine learning engineer at X who now leads A.I. at Bordo AI, a conversational data analytics software maker, told Observer. "By generating synthetic datasets, we can train LLMs on a diverse range of scenarios and edge cases that may not be adequately represented in real-world data. This improves the generalization capabilities of our models, making them more adaptable and effective in various applications." Synthetic data can be an alternative to sensitive data Artificially generated data can also be used to fill in information gaps when organizations don't want to give up their sensitive data, especially in high-impact sectors like health care, finance and law enforcement, said Neil Sahota, an A.I. advisor to the United Nations and CEO of the A.I. research firm ACSILabs. For example, hospitals can synthetically generate images of lung cancer X-rays at different angles as a way to train A.I. models that could help doctors identify tumors more quickly and accurately, Sahota said. Similarly, governments can train their A.I. on examples of money laundering that financial institutions don't make public to help identify the characteristics of actors behind corporate crime. "Synthetic data is a great way to bridge some of that gap," Sahota told Observer. Synthetic data also provides a way around intellectual property issues, a growing headache for A.I. companies. Training LLMs on synthetic data protects companies like OpenAI from being sued by artists, writers and publishers for using their creative works to train chatbots. "Synthetic training data could clear a lot of these issues," Star Kashman, an attorney specializing in litigation in the tech sector, told Observer. "That gets around the hurdle of unintentionally infringing upon other people's work." Synthetic data can create more problems -- and isn't always necessary Despite the potential technical and legal advantages of using synthetic data, training A.I. on non-human data comes with risks. Aside from the skepticism around so-called "fake data," synthetic data could perpetuate biases and inaccuracies in a model's pre-existing dataset if the A.I. isn't trained correctly. A Nature study published in July found that A.I. models generated lower-quality outputs after they were trained on A.I.-generated data -- a phenomenon known in the machine learning community as "model collapse." That could be, in part, because synthetic data generation techniques are still new, and there's just not enough engineers with the skills needed to perform and test them, according to Carlsson. "You can totally screw things up and make things worse," he said. In turn, companies that use biased synthetic data to train A.I. may be held liable if their models generate outputs that a plaintiff perceives as discriminatory, unethical or inaccurate, according to Kashman, the attorney. After all, there may still be lots of real-world data that has yet to be extracted, according to Mayur Pillay, vice president of corporate development at Hyperscience, an A.I. software that converts corporate documents like claims and invoices into machine-actionable data. While synthesizing data could be useful in some cases, there's no substitute for the real thing, especially for complex data types like handwriting on forms that are difficult to replicate because they require context, according to Pillay. "There's actually so much data still that can be used to train these specialized models," he said. "It's just embedded at the core of the enterprise." Even though synthetic data seems to pose a risk, some experts agree that if handled with caution, synthetic data, when mixed with real data, could help address the shortage of A.I. training data. Still, it seems unlikely that synthetic data will be the main trove of information A.I. companies turn to as they seek new sources of training data -- at least for now. "Currently, you have gigabytes and petabytes of data being used to train a large language model," Grover said. Clearly, we are not at the point yet where we can generate that amount of unbiased and balanced data set."
[2]
Is AI About to Run Out of Data? The History of Oil Says No
Levin is a Yorktown Institute Fellow and research lead at Kurzweil Technologies. Is the AI bubble about to burst? Every day that the stock prices of semiconductor champion Nvidia and the so-called "Fab Five" tech giants (Microsoft, Apple, Alphabet, Amazon, and Meta) fail to regain their mid-year peaks, more people ask that question. It would not be the first time in financial history that the hype around a new technology led investors to drive up the value of the companies selling it to unsustainable heights -- and then get cold feet. Political uncertainty around the U.S. election is itself raising the probability of a sell-off, as Donald Trump expresses his lingering resentments against the Big Tech companies and his ambivalence towards Taiwan, where the semiconductors essential for artificial intelligence mostly get made. The deeper question is whether AI can deliver the staggering long-term value that the internet has. If you invested in Amazon in late 1999, you would have been down over 90% by early 2001. But you would be up over 4,000% today. A chorus of skeptics now loudly claims that AI progress is about to hit a brick wall. Models such as GPT-4 and Gemini have already hoovered up most of the internet's data for training, the story goes, and will lack the data needed to get much smarter. Read More: 4 Charts That Show Why AI Progress Is Unlikely to Slow Down However, history gives us a strong reason to doubt the doubters. Indeed, we think they are likely to end up in the same unhappy place as those who in 2001 cast aspersions on the future of Jeff Bezos's scrappy online bookstore. The generative AI revolution has breathed fresh life into the TED-ready aphorism "data is the new oil." But when LinkedIn influencers trot out that 2006 quote by British entrepreneur Clive Humby, most of them are missing the point. Data is like oil, but not just in the facile sense that each is the essential resource that defines a technological era. As futurist Ray Kurzweil observes, the key is that both data and oil vary greatly in the difficulty -- and therefore cost -- of extracting and refining them. Some petroleum is light crude oil just below the ground, which gushes forth if you dig a deep enough hole in the dirt. Other petroleum is trapped far beneath the earth or locked in sedimentary shale rocks, and requires deep drilling and elaborate fracking or high-heat pyrolysis to be usable. When oil prices were low prior to the 1973 embargo, only the cheaper sources were economically viable to exploit. But during periods of soaring prices over the decades since, producers have been incentivized to use increasingly expensive means of unlocking further reserves. The same dynamic applies to data -- which is after all the plural of the Latin datum. Some data exist in neat and tidy datasets -- labeled, annotated, fact-checked, and free for download in a common file format. But most data are buried more deeply. Data may be on badly scanned handwritten pages; may consist of terabytes of raw video or audio, without any labels on relevant features; may be riddled with inaccuracies and measurement errors or skewed by human biases. And most data are not on the public internet at all. Read More: The Billion-Dollar Price Tag of Building AI An estimated 96% to 99.8% of all online data are inaccessible to search engines -- for example, paywalled media, password-protected corporate databases, legal documents, and medical records, plus an exponentially growing volume of private cloud storage. In addition, the vast majority of printed material has still never been digitized -- around 90% for high-value collections such as the Smithsonian and U.K. National Archives, and likely a much higher proportion across all archives worldwide. Yet arguably the largest untapped category is information that's currently not captured in the first place, from the hand motions of surgeons in the operating room to the subtle expressions of actors on a Broadway stage. For the first decade after large amounts of data became the key to training state-of-the-art AI, commercial applications were very limited. It therefore made sense for tech companies to harvest only the cheapest data sources. But the launch of Open AI's ChatGPT in 2022 changed everything. Now, the world's tech titans are locked in a frantic race to turn theoretical AI advances into consumer products worth billions. Many millions of users now pay around $20 per month for access to the premium AI models produced by Google, OpenAI, and Anthropic. But this is peanuts compared to the economic value that will be unlocked by future models capable of reliably performing professional tasks such as legal drafting, computer programming, medical diagnosis, financial analysis, and scientific research. The skeptics are right that the industry is about to run out of cheap data. As smarter models enable wider adoption of AI for lucrative use cases, however, powerful incentives will drive the drilling for ever more expensive data sources -- the proven reserves of which are orders of magnitude larger than what has been used so far. This is already catalyzing a new training data sector, as companies including Scale AI, Sama, and Labelbox specialize in the digital refining needed to make the less accessible data usable. Read More: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic This is also an opportunity for data owners. Many companies and nonprofits have mountains of proprietary data that are gathering dust today, but which could be used to propel the next generation of AI breakthroughs. OpenAI has already spent hundreds of millions of dollars licensing training data, inking blockbuster deals with Shutterstock and the Associated Press for access to their archives. Just as there was speculation in mineral rights during previous oil booms, we may soon see a rise in data brokers finding and licensing data in the hope of cashing in when AI companies catch up. Much like the geopolitical scramble for oil, competition for top-quality data is also likely to affect superpower politics. Countries' domestic privacy laws affect the availability of fresh training data for their tech ecosystems. The European Union's 2016 General Data Protection Regulation leaves Europe's nascent AI sector with an uphill climb to international competitiveness, while China's expansive surveillance state allows Chinese firms to access larger and richer datasets than can be mined in America. Given the military and economic imperatives to stay ahead of Chinese AI labs, Western firms may thus be forced to look overseas for sources of data unavailable at home. Yet just as alternative energy is fast eroding the dominance of fossil fuels, new AI development techniques may reduce the industry's reliance on massive amounts of data. Premier labs are now working to perfect techniques known as "synthetic data" generation and "self-play," which allow AI to create its own training data. And while AI models currently learn several orders of magnitude less efficiently than humans, as models develop more advanced reasoning, they will likely be able to hone their capabilities with far less data. There are legitimate questions about how long AI's recent blistering progress can be sustained. Despite enormous long-term potential, the short-term market bubble will likely burst before AI is smart enough to live up to the white-hot hype. But just as generations of "peak oil" predictions have been dashed by new extraction methods, we should not bet on an AI bust due to data running out.
Share
Share
Copy Link
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
In the rapidly evolving world of artificial intelligence, a new player has entered the field: synthetic data. This revolutionary approach to AI training is gaining traction as a solution to some of the most pressing challenges in the industry. Synthetic data, artificially generated information that mimics real-world data, is poised to transform the landscape of AI development 1.
One of the primary drivers behind the adoption of synthetic data is the growing scarcity of high-quality, diverse datasets. As AI applications become more sophisticated, the demand for extensive and varied data has skyrocketed. Synthetic data offers a viable alternative, allowing developers to generate vast amounts of data that can be tailored to specific needs 2.
Moreover, synthetic data provides a solution to the increasing privacy concerns surrounding data collection and usage. By creating artificial datasets that maintain the statistical properties of real data without containing actual personal information, companies can sidestep many of the legal and ethical issues associated with data privacy 1.
Experts in the field are noting significant improvements in AI model performance when trained on synthetic data. These artificially generated datasets can be designed to include edge cases and rare scenarios that might be underrepresented in real-world data. This comprehensive coverage allows AI models to become more robust and adaptable to a wider range of situations 2.
The synthetic data market is experiencing rapid growth, with projections suggesting it could reach billions of dollars in value within the next few years. This growth is driven by the increasing recognition of synthetic data's potential to accelerate AI development cycles and reduce costs associated with data collection and annotation 1.
Despite its promise, synthetic data is not without its challenges. Ensuring that synthetic datasets accurately represent the complexities and nuances of real-world data remains a significant hurdle. There are also concerns about potential biases that could be inadvertently introduced during the data generation process 2.
As the field of synthetic data continues to evolve, it is likely to play an increasingly important role in the development of AI technologies. Researchers and companies are investing heavily in improving synthetic data generation techniques, aiming to create ever more realistic and useful datasets 1.
The rise of synthetic data marks a significant shift in the AI landscape, potentially democratizing access to high-quality training data and accelerating the pace of innovation in the field. As this technology matures, it could reshape our understanding of data as a resource and redefine the boundaries of what's possible in artificial intelligence.
Synthetic data is emerging as a game-changer in AI and machine learning, offering solutions to data scarcity and privacy concerns. However, its rapid growth is sparking debates about authenticity and potential risks.
2 Sources
Tech companies are increasingly turning to synthetic data for AI model training due to a potential shortage of human-generated data. While this approach offers solutions, it also presents new challenges that need to be addressed to maintain AI accuracy and reliability.
2 Sources
A comprehensive look at the latest developments in AI, including OpenAI's internal struggles, regulatory efforts, new model releases, ethical concerns, and the technology's impact on Wall Street.
6 Sources
Ilya Sutskever, co-founder of OpenAI, warns that AI development is facing a data shortage, likening it to 'peak data'. This crisis could reshape the AI industry's future, forcing companies to seek alternative solutions.
3 Sources
Leading AI companies are experiencing diminishing returns on scaling their AI systems, prompting a shift in approach and raising questions about the future of AI development.
7 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved