The hidden reason AI costs are soaring -- and it's not because Nvidia chips are more expensive
But companies training AI models, or fine-tuning existing models to improve performance on specific tasks, also struggle with another often overlooked and rising cost: Data labeling. This is a painstaking process in which generative AI models are trained with data that is affixed with tags so that the model can recognize and interpret patterns.
Data labeling has long been used to develop AI models for self-driving cars, for example. A camera captures images of pedestrians, street signs, cars, and traffic lights and human annotators label the images with words like "pedestrian," "truck," or "stop sign." The labor-intensive process has also raised ethics concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing the data labeling work that helped make the chatbot less toxic to Kenyans earning less than $2 hourly.
Today's generic large language models (LLMs) go through an exercise related to data labeling called Reinforcement Learning Human Feedback, in which humans provide qualitative feedback or rankings on what the model produces. That is one significant source of rising costs, as is the effort involved in labeling private data that companies want to incorporate into their AI models, such as customer information or internal corporate data.
In addition, labeling highly technical, expert-level data in fields like legal, finance, and healthcare is driving up expenses. That's because some companies are hiring high-cost doctors, lawyers, PhDs, and scientists to label certain data or outsourcing the work to third-party companies such as Scale AI, which recently secured a jaw-dropping $1 billion in funding as its CEO predicted strong revenue growth by year-end.
"You now need a lawyer to label stuff, [which is] a crazy use of legal hours," said William Falcon, CEO of AI development platform Lightning AI. "Anything high stakes" requires expert-level labeling, he explained. "A chat with a 'virtual BFF is not high stakes, providing legal advice is."
Alex Ratner, CEO of data labeling startup Snorkel AI, says corporate customers can spend millions of dollars on data labeling and other data tasks, which can eat up 80% of their time and AI budget. Over time, data must also be relabeled to remain up to date, he added.
Matt Shumer, CEO and cofounder of AI assistant startup Otherside AI, agreed that fine tuning LLMs has gotten expensive. "Over the past couple of years, we've gone from middle-school-level data being okay to needing high school, college, and now expert," he said. "That obviously doesn't come cheap."
That can create budget woes for tech startups building in important areas like healthcare. Neal Shah, CEO of CareYaYa, a platform for elder caregivers, says his company received a grant from Johns Hopkins University to build "the world's first AI caregiver trainer for dementia patients," but that data labeling costs are "eating us alive." The cost, he said, has skyrocketed 40% over the past year because of the specialized information needed from gerontologists, dementia experts, and veteran caregivers. He's working to reduce those costs by enlisting healthcare students and college professors to do the labeling.
Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, said he has seen data labeling projects that cost millions. Platforms like BeeKeeper AI, he said, can help lower costs by letting multiple companies share experts, data, and algorithms without exposing their private data to the others.
Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are lowering costs by using "synthetic" data -- or data generated by the AI itself -- to at least partially automate data collection and labeling. In some cases, models can fully automate any data labeling. For example, biopharma companies are training generative AI models for developing synthetic proteins for conditions like colo-rectal cancer, diabetes, and heart disease. The companies automatically conduct experiments based on the outputs of generative AI models, which provide new training data that comes with labels.
The bottom line, however, is that data labeling may be costly and time-intensive, but well-worth it. "Data labeling's a beast," said CareYaYa's Shah. "But the potential payoff is massive."
DeepMind military protest. Nearly 200 DeepMind staffers want Google's AI unit to stop working with the military, Time reports. A letter to management reportedly says Google's Cloud business is breaking the company's rules by selling AI to militaries that are at war -- no names are named, but there are links to reports on Google's dealings with the Israeli military and (allegedly) Israeli weapons firms. Google claims only Israeli government ministries are using its cloud services, with no "military workloads relevant to weapons or intelligence services."
China's Amazon route. Reuters reports that state-linked entities in China have been using Amazon's cloud services to access the kind of advanced chips and AI that U.S. export controls aim to hold back from China. The U.S. rules ban exports and transfers of advanced chips and AI software to Chinese entities, but access through the cloud is allowed. Amazon Web Services says it's not breaking any rules.
Cruise + Uber. GM's Cruise robo-taxi unit, which is trying to get back on track after serious setbacks, has struck a deal with Uber to offer self-driving services in an unspecified U.S. city, the Financial Times reports. Uber already has a similar arrangement with Alphabet's Waymo for robo-taxi services in Phoenix. That said, Cruise isn't offering autonomous rides right now -- it's still testing its cars with human drivers after a long pause that followed an incident in San Francisco in which a pedestrian was dragged underneath one of its cars.
"You can find a few interesting use cases, but broadly, it seems like there's a lot of caution around this...Particularly around bigger companies that have complex permissions around their SharePoint or their Office 365 or things like that, where the Copilots are basically aggressively summarizing information that maybe people technically have access to but shouldn't have access to."
-- Securiti chief data officer Jack Berkowitz tells The Register that half the peers he's polled have paused their rollouts of Microsoft's Copilot, an AI assistant that he alleges is accessing internal corporate data that it shouldn't.
AI makes self-driving cars possible. So why is the industry keeping its distance?, by Sage Lazzaro
Alibaba is upgrading its Hong Kong listing to primary, and that could potentially unlock billions in new investment, by Lionel Lim
The stranded Boeing Starliner astronauts planned to hitch a ride home with SpaceX, but their spacesuits aren't compatible with Elon Musk's spacecraft, by Marco Quiroz-Gutierrez
A California woman outsmarted two alleged mail thieves by sending herself an AirTag, by the Associated Press
I sold a $1.4B big-data startup to IBM -- then founded a nature sanctuary. Here are the dangers of AI energy consumption, by Chris Gladwin (Commentary)
Jelly Pong. Scientists have managed to make a "soft and squidgy water-rich gel" learn how to play the vintage video game Pong, the Guardian reports. What's more, the hydrogel actually gets better at the game with time as it has memory, though it is not sentient, the U.K. researchers said. However, the jelly-like material isn't as good a Pong player as another system that was shown off a couple years ago, based on a bunch of neurons in a dish -- satisfyingly, that system was named DishBrain.