Data is pretty much everything when it comes to training AI systems, but accessing enough data to produce quality products that live up to their promise is a major challenge, even for companies with the deepest of pockets.
This is a problem that Advex AI is setting out to address, using generative AI and synthetic data to "solve the data problem," as the company puts it. More specifically, Advex allows customers to train their computer vision systems using a small sample of imagery, with Advex generating thousands of "fake" pictures from that sample.
Today signals Advex's formal launch at TechCrunch Disrupt 2024 on the Startup Battlefield stage, though it has already secured a handful of customers through its stealth phase. This includes what it calls "seven major" enterprise clients, which it says it's not at liberty to disclose. TechCrunch can also reveal that the San Francisco-based startup has raised $3.6 million in funding, the bulk of which came via a $3.1 million seed tranche last December, with notable backers including Construct Capital, Pear VC, and Laurene Powell Jobs' Emerson Collective.
CEO Pedro Pachuca started Advex with his CTO co-founder Qasim Wani a little over a year ago, and the company has a headcount of six. That such a svelte startup has already made it into the industry with real paying customers is notable, with Pachuca putting at least some of this down to his background, as well as good old-fashioned networking and cold reach-outs. Indeed, Pachuca was previously a machine learning researcher at Berkeley, and later joined the research team at Google Brain before it merged into DeepMind.
"If the ROI [return on investment] makes sense, they'll [customers] trust us a bit," Pachuca said. "I have done a lot of research in this space -- being at Google Brain before gives me a little bit of credibility. But at the beginning it was cold emails, and that got us our first two big customers. Then it was conferences -- that's why I go to so many of them!"
Pachuca was about to head over to Europe just after concluding his interview with TechCrunch, where he planned to attend various meetings and conferences, including the European Conference on Computer Vision (ECCV) in Milan (Italy) and Vision in Stuttgart (Germany).
"There's a lot of conferences out there in Europe," Pachuca said. "We're going to ECCV to learn and hire, basically," Pachuca added. "And Vision is more on the industrial side, so we're there to sell."
Potential customers include legacy developers of machine vision systems, along the lines of Cognex or Keyence, which are striving to bolster their products with better AI. But on the other side, Advex might sell directly to the end-user businesses, such as car manufacturers or logistics companies building their own in-house tooling.
For example, a car manufacturer might need to teach its computer vision system to recognize defects in the material of their car seats. However, even if the company could access hundreds of distinct images, the fact is that no two defects look the same. So instead, the manufacturer can upload a dozen pictures of seats with tears in them, with Advex extrapolating from that to generate thousands of "defected" seat pictures to build a far more extensive and diverse pool of training data.
The same can be applied to just about any manufacturing sector, from oil and gas to wood furnishings -- it's all about reducing data collection time and costs by artificially creating training imagery.
Synthetic data isn't a new concept, of course, but with the AI revolution in full swing, businesses are seeking to bridge the data gaps -- this includes areas such as market research, where survey samples may be too small, as well as computer vision as we're seeing with the likes of Advex, among other VC-backed startups such as Synthesis AI and Parallel Domain.
Broadly speaking, there are two kinds of models that Advex deals with. The model that's deployed at the customer's site, the one that the customer's own images train, is just standard off-the-shelf "open source stuff," as Pachuca puts it. "That's because they need to be small, and we also don't believe that the gains come from the architecture of the model -- they come from training on the right data," he said.
But the real secret sauce is in the company's proprietary diffusion model, similar to something like Midjourney or Dall-E, and is what's used to create the synthetic data. "That one is custom, and is highly complicated -- that's where we put all of our effort," Pachuca added.
While Advex's manufacturing focus is one way it differentiates, it's really the diffusion model approach where the company sees itself as standing out.
In comparison to other simulation and modeling techniques, such as those aligned with game/physics engines (e.g. Unity), Pachuca says that using diffusion means there is no setup required, and generation takes just seconds per image/label pair -- plus it's far closer to real-life data.
"We're not just creating any images, we're creating the images you don't have -- specifically trying to understand what is missing, and creating that," Pachuca said. "And this 'what is missing' part is really hard, and it's very invisible, but it's one of the biggest innovations that we've made."