Curated by THEOUTPOST
On Tue, 11 Mar, 12:05 AM UTC
2 Sources
[1]
Will synthetic data derail generative AI's momentum or be the breakthrough we need?
With the rise of generative AI, synthetic images and text have become common knowledge -- but are you familiar with synthetic data? As the name implies, the term refers to data that is artificially generated and used to replace real data. It is used to create solutions for healthcare, finance, the automotive industry, and, most importantly, artificial intelligence. Synthetic data is such an integral part of the digital revolution that South by Southwest (SXSW) held an AI session titled "Impact of Simulated Data on AI and the Future," meant to analyze the technology's ability to bolster and support generative AI, while also evaluating the potential risks. Also: 10 key reasons AI went mainstream overnight - and what happens next The panel featured expert panelists Mike Hollinger, director of product management, enterprise Gen AI software at NVIDIA; Oji Udezue, CPO at Typeform; and Tahir Ekin, Fields Chair in business analytics at Texas State University, who all retained an overall positive outlook on the technology. "For us, it [synthetic data] makes our ability to build the right thing cheaper and better -- which is a holy grail," said Udezue. For more on synthetic data's potential to advance the AI space, its risks, and advice from the experts on how to proceed, read more below. Synthetic data enables users to simulate real-world insights in situations where collecting actual data would be too costly, time-consuming, or could pose privacy concerns -- such as involving sensitive financial information. Its recent surge in popularity is largely due to its growing role in training and refining machine learning and AI models, which has become increasingly crucial amid the rapid development of these models in the past year. Also: Is your business AI-ready? 5 ways to avoid falling behind "With ChatGPT, with Gemini, with Claude, with DeepSeek, with any of these models, inside of that model's training data is most likely a synthetic generation step," said Hollinger. "This synthetic data is taking parts of that training material, and it's amplifying it to give different variations so that I could then train the model to give whatever the output is." Synthetic data is especially valuable for AI models because they require large, diverse, and high-quality datasets for effective training that can be difficult or impractical to obtain. This is particularly true when targeting niche, proprietary, or original datasets that aren't readily available through public data scraping. In a report released last week, research firm Gartner identified synthetic data as one of the top data and analytics trends for 2025. Specifically, the report encourages the use of synthetic data to supplement areas where insight is missing or incomplete or to replace sensitive data to prioritize privacy. To create synthetic data, complex algorithms take an original data set and replicate the patterns, structures, and other characteristics found within that data. However, like with any other AI output, there is potential for some deviations that can have a significant impact. To illustrate that idea, Hollinger used the example of how many hours were in the day on the day of the conference, which was a tricky question because, technically, on Sunday, there were 23 hours due to daylight savings. If a sample of data were taken from random days throughout the year, it would be possible that one of the days selected would be from a city with daylight savings time changes, where there was an hour less. A synthetic data pipeline built from this sample would have erased the model's accuracy. Also: Here's what AI likely means for traditional BI and analytics tools Consequently, when building synthetic datasets, it is imperative that the data be grounded in the real world to avoid these types of incongruences and ensure that the dataset is as representative of the scenario it is meant to represent as possible. However, even when taking this measure and accounting for entropy, it is often difficult to ensure accuracy, according to Udezue. "Humans are unpredictable in unpredictable ways," said Udezue. "How do you predict the variation for 8 billion people?" Beyond the technical challenges, one of the biggest hurdles to overcome will be earning user trust when using synthetic data as the primary source to inform and create new solutions. To build that trust, transparency around how synthetic data is generated, validated, and applied, with clear delineation, such as through model cards, is important. "The trust aspect -- from the user perspective, we are utilizing these AI tools, but how do you feel getting into a self-driving car that wasn't tested on the road but was only tested using simulated data?" said Ekin. Despite the challenges, the panel remained optimistic about using the technology in the future of AI and beyond. This doesn't mean the challenges aren't there or that work doesn't have to be done, but its overall potential to fuel growth across all sectors is still great. Also: How businesses are accelerating time to agentic AI value "Simulated data, when correctly used, will elevate science, will elevate software, will elevate the industry, but what we have to get the governance and transparency right, or we won't be able to take advantage of it properly," said Udezue.
[2]
Gen AI Needs Synthetic Data. We Need to Be Able to Trust It
Expertise Energy, Solar Power, Renewable Energy, Climate Issues, Virtual Power Plants, Grid Infrastructure, Electric Vehicles, Plug-in Hybrids, Energy-Savings Tips, Smart Thermostats, Portable Power Stations, Home Battery Solutions, EV Charging Infrastructure, Home Today's generative AI models, like those behind ChatGPT and Gemini, are trained on reams of real-world data, but even all the content on the internet is not enough to prepare a model for every possible situation. To continue to grow, these models need to be trained on simulated or synthetic data, which are scenarios that are plausible, but not real. AI developers need to do this responsibly, experts said on a panel at South by Southwest, or things could go haywire quickly. The use of simulated data in training artificial intelligence models has gained new attention this year since the launch of DeepSeek AI, a new model produced in China that was trained using more synthetic data than other models, saving money and processing power. But experts say it's about more than saving on the collection and processing of data. Synthetic data -- computer generated often by AI itself -- can teach a model about scenarios that don't exist in the real-world information it's been provided but that it could face in the future. That one-in-a-million possibility doesn't have to come as a surprise to an AI model if it's seen a simulation of it. "With simulated data, you can get rid of the idea of edge cases, assuming you can trust it," said Oji Udezue, who has led product teams at Twitter, Atlassian, Microsoft and other companies. He and the other panelists were speaking on Sunday at the SXSW conference in Austin, Texas. "We can build a product that works for 8 billion people, in theory, as long as we can trust it." The hard part is ensuring you can trust it. Simulated data has a lot of benefits. For one, it costs less to produce. You can crash test thousands of simulated cars using some software, but to get the same results in real life, you have to actually smash cars -- which costs a lot of money -- Udezue said. If you're training a self-driving car, for instance, you'd need to capture some less common scenarios that a vehicle might experience on the roads, even if they aren't in training data, said Tahir Ekin, a professor of business analytics at Texas State University. He used the case of the bats that make spectacular emergences from Austin's Congress Avenue Bridge. That may not show up in training data, but a self-driving car will need some sense of how to respond to a swarm of bats. The risks come from how a machine trained using synthetic data responds to real-world changes. It can't exist in an alternate reality, or it becomes less useful, or even dangerous, Ekin said. "How would you feel," he asked, "getting into a self-driving car that wasn't trained on the road, that was only trained on simulated data?" Any system using simulated data needs to "be grounded in the real world," he said, including feedback on how its simulated reasoning aligns with what's actually happening. Udezue compared the problem to the creation of social media, which began as a way to expand communication worldwide, a goal it achieved. But social media has also been misused, he said, noting that "now despots use it to control people, and people use it to tell jokes at the same time." As AI tools grow in scale and popularity, a scenario made easier by the use of synthetic training data, the potential real-world impacts of untrustworthy training and models becoming detached from reality grow more significant. "The burden is on us builders, scientists, to be double, triple sure that system is reliable," Udezue said. "It's not a fantasy." One way to ensure models are trustworthy is to make their training transparent, that users can choose what model to use based on their evaluation of that information. The panelists repeatedly used the analogy of a nutrition label, which is easy for a user to understand. Some transparency exists, such as model cards available through the developer platform Hugging Face that break down the details of the different systems. That information needs to be as clear and transparent as possible, said Mike Hollinger, director of product management for enterprise generative AI at chipmaker Nvidia. "Those types of things must be in place," he said. Hollinger said ultimately, it will be not just the AI developers but also the AI users who will define the industry's best practices. The industry also needs to keep ethics and risks in mind, Udezue said. "Synthetic data will make a lot of things easier to do," he said. "It will bring down the cost of building things. But some of those things will change society." Udezue said observability, transparency and trust must be built into models to ensure their reliability. That includes updating the training models so that they reflect accurate data and don't magnify the errors in synthetic data. One concern is model collapse, when an AI model trained on data produced by other AI models will get increasingly distant from reality, to the point of becoming useless. "The more you shy away from capturing the real world diversity, the responses may be unhealthy," Udezue said. The solution is error correction, he said. "These don't feel like unsolvable problems if you combine the idea of trust, transparency and error correction into them."
Share
Share
Copy Link
Experts discuss the potential and challenges of using synthetic data in AI development, highlighting its importance for advancing generative AI while emphasizing the need for trust, transparency, and real-world grounding.
Synthetic data, artificially generated information used to replace real data, is emerging as a crucial component in the development of generative AI models. As highlighted at a recent South by Southwest (SXSW) panel, this technology is becoming integral to training and refining machine learning and AI models, particularly in scenarios where collecting actual data is costly, time-consuming, or raises privacy concerns 12.
Synthetic data offers several benefits for AI development:
Mike Hollinger, director of product management at NVIDIA, noted that most current large language models likely incorporate synthetic data in their training process 1.
Despite its potential, synthetic data poses several challenges:
To address these challenges, experts emphasize the need for:
Despite the challenges, experts remain optimistic about the potential of synthetic data in advancing AI technology. Oji Udezue, a product management expert, stated, "Simulated data, when correctly used, will elevate science, will elevate software, will elevate the industry, but we have to get the governance and transparency right" 1.
As the AI industry continues to evolve, the responsible use of synthetic data will likely play a crucial role in shaping the future of generative AI and its applications across various sectors.
Tech companies are increasingly turning to synthetic data for AI model training due to a potential shortage of human-generated data. While this approach offers solutions, it also presents new challenges that need to be addressed to maintain AI accuracy and reliability.
2 Sources
2 Sources
Synthetic data is emerging as a game-changer in AI and machine learning, offering solutions to data scarcity and privacy concerns. However, its rapid growth is sparking debates about authenticity and potential risks.
2 Sources
2 Sources
Synthetic data is emerging as a game-changer in AI development, offering a solution to data scarcity and privacy concerns. This new approach is transforming how AI models are trained and validated.
2 Sources
2 Sources
An in-depth look at three types of synthetic data methods, their applications, and how businesses can leverage them for innovation and problem-solving.
2 Sources
2 Sources
Experts raise alarms about the potential limitations and risks associated with large language models (LLMs) in AI. Concerns include data quality, model degradation, and the need for improved AI development practices.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved