2 Sources
[1]
As better chatbots get harder to build, AI turns to simulated worlds
In the New York City office of a startup, an artificial intelligence (AI) program dreams up a world on the fly. As I move through a video game -- navigating rooms, confronting other characters -- each frame is generated in real time in a New Jersey data center. But this challenge wasn't designed for humans. I pause and an autonomous AI agent takes over for me. "An AI playing in the mind of another AI," says General Intuition co-founder Adam Jelley. General Intuition is betting this kind of system -- one in which AI agents learn by acting within simulated worlds -- will eventually outsmart large language models (LLMs), the dominant AI models behind chatbots such as ChatGPT, Claude, Gemini, and Grok. For the past decade or so, industry has taken for granted that bigger LLMs are better. That was the conclusion of a heavily cited 2020 paper, which found performance improves as long as you boost a model's size, training data, or the computing power used to train it. "We were very surprised" by the trends' consistency, says study leader Jared Kaplan, a theoretical physicist at Johns Hopkins University. These "scaling laws" appeared to hold up for several orders of magnitude. The lesson to the AI industry was clear: Grow at (nearly) all costs. Since the publication of the paper, the companies have done just that. They're spending hundreds of billions of dollars a year to build AI models with hundreds of billions or even trillions of adjustable parameters, trained on trillions of words and images scraped from the internet. OpenAI's latest GPT models are estimated to be several times the size of its largest 2020 model and trained on an order of magnitude more data. More recently, companies have added another scaling lever: test-time compute. By getting models to think longer before answering, performance can be improved without necessarily training bigger models on more data. The results are impressive: LLMs can pass lawyers' bar exams and doctors' medical licensing exams. They can match top high school students in the International Mathematical Olympiad. They can write poetry that many readers find more beautiful than human works, and top programmers now use them to write most of their code. Kaplan, who went on to co-found Anthropic and is now its chief science officer, is among the many who think LLMs still have plenty of room to run. Empirically, model performance has followed the scaling laws he sketched out in 2020. "You can view it as a self-fulfilling prophecy," Kaplan says. For the foreseeable future, "If you're not seeing clean scaling laws, then you're doing something wrong." But some researchers argue LLM scaling is running into practical limits. Kaplan's paper described a power law: Each gain requires disproportionately more resources. Meanwhile companies are running out of training data to scrape. One 2024 study estimated they'll exhaust high-quality public text data in the next few years. As for computational power, chips and algorithms are gaining efficiency, but not quickly enough. Data centers under development will each draw gigawatts of power, straining the grid. Other concerns are more fundamental. LLMs, even variants trained to "reason" or process images alongside text, are built primarily to predict the next token -- a fragment of text or image data -- in a sequence. They first learn statistical patterns from their enormous data sets, then are fine-tuned to produce more useful or reliable answers. But LLMs do not experience the world they describe. They have no way to test a hypothesis or probe an environment. They learn patterns of cause and effect indirectly. That can be enough to generate fluent explanations or plausible plans. But it can fall short when success depends on understanding the consequences of actions in the real world. When asked how to stack common objects, for example, the models sometimes stumble, showing a lack of common sense. Those shortcomings are consequential in the real world. You wouldn't trust an LLM to act as a child therapist or police officer. LLMs are "getting better and better, but you cannot just throw more data at it and expect it to magically improve," says Jane Wang, a research scientist at Google DeepMind. "I do think there is a lot left in textual intelligence," says Jakub Pachocki, OpenAI's chief scientist. "But it's quite clear that humans don't reason only in words." Many researchers are now convinced that humanlike AI, or artificial general intelligence (AGI), will require more than mastering language and images. It will require AIs that can reason about space, causality, and the consequences of actions -- especially if they are to control humanoid robots, operate factories, and explore other planets. Few people have argued for this need more forcefully than AI pioneer Yann LeCun. "I joke that the smartest systems we have today are not as smart as a house cat," he says. A cat can't code like an LLM, but it can survive by its wits. The notion that simply scaling an LLM will get to AGI is "complete nonsense," he says. "It's like saying you're going to get into orbit by scaling airplanes. There's a very powerful delusion circulating in Silicon Valley that this is the case." LeCun left a top job at Meta to co-found one of a growing number of labs and startups developing "world models" -- systems that build representations of how the world works -- and agents that operate within them to learn or plan. Ultimately, these researchers hope that more closely mimicking how the human mind learns will give AI stunning new powers. The gaps between humans and LLMs are not merely quantitative. In a 2024 study, LLMs that were trained on sequences of directions from New York City taxi rides could generate new routes reliably, suggesting they had turned those directions into an accurate map of the city. But when researchers looked under the hood to examine their internal representations, they found not a clean city grid, but an incoherent mess of tangled streets. LLMs "are so alien and so unhumanlike," says Brenden Lake, a cognitive scientist at Princeton University. Lake has trained AIs on hundreds of hours of headcam videos from toddlers to see what they could learn from the input. Children acquire language from remarkably little data compared with LLMs: millions of words rather than trillions. More important, children spend a year or two exploring before they learn language -- touching objects, navigating spaces, and observing how the world responds, whereas LLMs start with language. AI, Lake says, "has development backwards." He thinks that difference matters. Spurred by innate curiosity and a desire to experiment, humans can flexibly combine simple concepts and apply them in new situations: A child who learns to skip, for example, can skip to the door and back, or skip while singing. LLMs can also improvise, he says, "but it's this frustrating mix of intelligence and bizarre failures." LLMs are limited because they are disembodied, he says: They need to learn like children. "To understand a word in the way that a human does, AI needs to be grounded in objects in the real world." Taking Lake's advice to heart, some labs and AI companies are now trying to build general AI systems that learn less like chatbots and more like embodied agents -- through experience, interaction, and experimentation. Researchers are pursuing that goal in two main ways. Some systems have agents learn by trial and error inside simulated worlds before taking their lessons to the real world. Others build predictive worlds that agents carry with them, so they can internally test their actions ahead of time. You might think of the two approaches as offline and online world models. They enable reflexive and deliberative action, respectively. Pursuing the first approach, researchers at Google DeepMind have developed Genie, a series of world models that can generate navigable 3D environments from prompts or videos, almost like creating a video game world on demand. The environments serve as a training and testing ground for agents such as the Scalable Instructable Multiworld Agent (SIMA), which is based on Google's Gemini model. Training in a range of realms enables SIMA to carry out instructions in unfamiliar worlds -- for example, exploring a scene, identifying strange objects, and attempting to guess what they're made of. Such agents would find purpose beyond gaming, says Shlomi Fruchter, a research director at Google DeepMind -- for example, as an AI scientist that could operate in the changing environment of a scientific lab. If an AI agent can't perform tasks physically, "then it's very limiting in terms of the impact." Nvidia, the AI chipmaker, is already pushing that idea into the physical world. The company trains agents inside simulated environments before deploying them in robots that could operate in warehouses and factories, where success depends not just on abstract reasoning, but coordinated movement in unpredictable environments. Its GR00T agents take in camera feeds and language instructions, then generate actions for a robot to carry out. In virtual demonstrations, GR00T-guided robots can place a potato in a microwave and shut the door -- a trivial task for a human, but one that requires coordination and planning. One of Nvidia's world models, DreamZero, attempts to predict how the world will evolve after an action is taken, helping robots adapt to unfamiliar environments and tasks, says Yuke Zhu, a computer scientist at the University of Texas at Austin who helped develop the Nvidia systems. Yet while LLMs can solve math Olympiad problems and write software, world-model agents are still struggling to grasp coffee cups and navigate toy worlds. Part of the problem is that the physical world is vastly more complex than language. According to General Intuition co-founder Pim de Witte, text compresses the four dimensions of reality into a single dimension. Real environments are noisy, continuous, and dynamic. Robots must contend with changing lighting, shifting objects, and the consequences of their own actions. Another problem, de Witte notes, is that world models need data that reflect this complexity: examples of actions and outcomes, whether it's robots manipulating objects or humans navigating spaces in video games. Those examples are few and expensive compared with text. Researchers are leveraging a bounty of passive data in YouTube videos, but watching is not the same as doing; they still need more data from embodied interactions. In the race toward AGI, cheap but thin text data may give LLMs an advantage. LeCun is betting on richer, scarcer world data. He was the chief AI scientist at Meta, but when the company began to move away from robotics and toward LLMs he grew restless. LeCun says he realized that "I could basically snap my fingers and raise a billion." So that's what he did. In March, with a little over $1 billion in funding, he unveiled Advanced Machine Intelligence (AMI) Labs, which has adopted the online approach to exploiting world models. Its systems try to understand the physical world predictively, by carrying a world model around with them. AMI's core technology is a family of algorithms called the Joint-Embedding Predictive Architecture (JEPA). Unlike many world models, which predict every pixel of a video frame, a JEPA model tries to predict abstract representations of future states. When driving, for example, you care whether a traffic light is red or green, not the exact appearance of the light. LeCun argues intelligent systems need to reason about these high-level properties, filtering out irrelevant detail. To give agents the ability to imagine future outcomes, researchers first trained JEPA models on thousands of hours of YouTube videos, allowing them to learn how scenes evolve over time. They then adapted the models to robotics, teaching them to predict how objects and robot arms respond to particular actions. The goal is not simply reflexive behavior, but planning. Given a desired future state -- say, moving a robot arm to grasp a coffee cup -- an agent using JEPA can simulate possible action sequences within its internal world before choosing one. The task is beyond LLMs. "You can't model the behavior by predicting discrete tokens," LeCun says. "So you're not going to be able to use LLMs for any of that." LeCun envisions applying JEPA systems first to robotic controls in industries such as power plants, aerospace, and medicine. But, he says, "The ultimate goal is to build universal intelligence systems." At Anthropic, researchers are largely avoiding any explicit attempts to develop world models. Kaplan says that's partly a business decision to focus on AI systems useful for coding, writing, and office work rather than robotics. But it also reflects a deeper conviction: that today's LLMs are still far from exhausting their potential. As LLMs scale, Kaplan argues, they often develop unexpected new capabilities -- a phenomenon called "emergence." Over the past several years, LLMs have acquired abilities that smaller versions lacked: performing arithmetic, carrying out multistep reasoning, and writing functional software. Such skills were not explicitly programmed, and in many cases researchers did not predict when they would appear. To Kaplan, those surprises suggest intelligence may emerge from sufficiently large LLMs without requiring fundamental new architectures. He rejects the idea that LLMs are merely statistical parrots disconnected from reality. LLMs, he argues, already contain internal representations of the world, built indirectly from patterns in text and images. Otherwise, they wouldn't be able to provide driving directions at all. "Some people have suggested that you can't train AGI without embodiment," he says, "and I'm personally very skeptical of that." De Witte concedes that sufficiently large LLMs may develop implicit world models. "The question is," he says, "at what cost?" At the company's New York City office, Jelley mimics a wiping motion across a table with his hand. A chatbot can describe how to clean a table, he says. But knowing how much pressure to apply, or how to adjust movement as crumbs scatter, is much harder to learn from words alone. "It's kind of like the picture-is-worth-a-thousand-words view of why you need world models." Whether that approach has a better chance reaching humanlike intelligence than LLM scaling is unresolved -- and progress on the two fronts may not be mutually exclusive. World models and language models can be combined, de Witte says. An LLM could call on a world model for spatial tasks, or a world model could call on an LLM for language tasks. In LeCun's experiments, for example, a JEPA model paired with an LLM was able to answer questions about what people in videos were likely to do next. More interestingly, de Witte says, world models might just acquire the ability to model language as part of the world, by perceiving writing and speech through visual and audio feeds -- the way humans do when reading, writing, and conversing. On my way out of General Intuition, Jelley handed me a sticker bearing the company's logo, which resembles an upside-down A. It looks like the mathematical symbol meaning "for all," he said. And it hints at the overarching goal. "If you squint, you can get a 'G-I' out of it as well."
[2]
Top developers are shifting from chatbots to physical AI. Here's why
Computer scientist Louis Castricato was in his eighth year studying large language models -- the artificial intelligence technology behind chatbots like ChatGPT and Claude -- when he started to feel like he was hitting a dead end. "We basically have passed the point of doing real fundamental LLM research," Castricato said. "Now it's just applications." The researcher quit his doctoral studies at Brown University and started a new company, called Overworld. Its ambition is in its name: AI that can understand and navigate a world, not just words. There's still plenty of money to be made from AI chatbots -- investors are counting on it as they commit trillions of dollars to leading developers like Anthropic and OpenAI. But a growing number of AI entrepreneurs are dedicating themselves to what they see as the next frontier: "world models" that teach AI systems, and sometimes robots, how to react in a physical environment. They include some of the field's most prominent scientists, such as "Godmother of AI" Fei-Fei Li, who describes the concept of a world model as "one of the most important and most overloaded terms in AI today." Scientists are applying AI in new dimensions with 'world models' At the heart of world model research is the idea that AI can't be truly intelligent if it can only read a book. It also needs to read the room. "Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics," wrote Li, founder of the San Francisco startup World Labs, in an essay published this month. Another proponent is AI pioneer Yann LeCun, who quit his job as Meta's chief AI scientist last year to start Paris-based Advanced Machine Intelligence Labs. "World model is quickly becoming a buzzword," LeCun said on a recent "Unsupervised Learning" podcast. He said he views it as something that enables an AI agent "to predict the consequences of its own actions." There are multiple ways of defining world models, often based on the technologies someone hopes to build with it -- be it robots or a more interactive video game. Robots can't learn much from AI models trained on books Training on all of humanity's books, news articles, and visual media, as AI language models have done, has led to AI assistants that are changing the nature of office-based work and some creative fields. But some proponents see limitations in generative AI models that work by repeatedly predicting the next word or pixel to produce new dialogue, images, or lines of code. Chatbots can't pick up a coffee mug, notes Martial Hebert, dean of computer science at Carnegie Mellon University. "There's all the geometry of the world, the dynamic of how I move my hand, the physical interaction of the contact with the cup," Hebert said. "This is much more complex than just predicting the next word in a sentence." For scientists like Hebert, who has spent more than four decades researching robotics, the most useful application for world models is as a faster and cheaper path to "physical AI" -- another tech industry buzzword. "Some people may have different definitions, but physical and embodied AI are kind of the evolution of what we used to call robotics," Hebert said in an interview. Some of the AI advances that have made chatbots so useful can also be applied to building AI with a broad enough awareness of its environment to work like a robot's brain, he said. "In your body and spinal cord you have a very general model of how to balance, how to walk around, and you can adapt to your knee hurting in the morning, so you now walk a little differently," he said. "You don't need to think about that. You have a general model somewhere in your nervous system and brain that allows your body to adapt very quickly." Simulated worlds are drawing interest from investors Smarter robots aren't the only end game for world models. Castricato started Overworld last year and the tiny Rhode Island-based startup is now building video game worlds where a scene, say, of a spooky forest, can adapt as a virtual character moves through it and interacts with the objects in it. "There's no other world model where you can just walk through doors or where you can interact with a detailed environment like this," he said in an interview. "We optimize for interaction above anything else." While the near-term applications aren't as readily apparent as AI coding tools, world model makers are attracting interest from venture capitalists like Steve Jang, co-founder and managing partner at Kindred Ventures. The firm is investing in Overworld and other world model-focused companies, including Causal Labs, which is building AI models for weather prediction, and Extropic, which is building specialized computer chips suited to world models. "I think that the future is many different types of models with many different philosophies and architectures," Jang said. "I don't think that it'll be one large, dense model to rule them all." In her recent essay, Li sought to create a "taxonomy of world models" to help sort out the confusion about the competing visions. "A video model that produces gorgeous but physically impossible flames, a language model improvising a playable game, and a physics engine that faithfully simulates combustion all go by the same name," she wrote. She divided world models into three categories. The most commercially viable today are "renderers" that prioritize the visual fidelity of the virtual worlds they create but can't be trusted to teach robots much. Then, there are "simulators" that create virtual training grounds that faithfully represent the physical structure of a world; and "planners" that try to predict what an AI agent or robot should do in an unstructured world. "A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first," she wrote.
Share
Copy Link
Leading AI researchers are pivoting from chatbots to world models that teach AI agents how to navigate physical environments. Pioneers like Yann LeCun and Fei-Fei Li argue that true intelligence requires more than text prediction—it demands spatial and causal understanding of how actions produce consequences in real-world settings.
A fundamental shift is underway in artificial intelligence development. At General Intuition's New York office, AI agents learn by navigating video game environments generated in real time by another AI system. Co-founder Adam Jelley describes it as "an AI playing in the mind of another AI." The startup is betting that AI agents trained within simulated worlds will eventually outsmart large language models, the technology powering ChatGPT, Claude, and other chatbots
1
.
Source: Fast Company
This approach marks a departure from the industry's decade-long assumption that bigger models always perform better. A heavily cited 2020 paper established scaling laws showing performance improves with model size, training data, and computational power. Study leader Jared Kaplan, now Anthropic's chief science officer, says empirically these laws still hold: "If you're not seeing clean scaling laws, then you're doing something wrong"
1
.Yet researchers increasingly recognize practical constraints. Companies are spending hundreds of billions annually to build models with trillions of parameters, but they face mounting challenges. A 2024 study estimated high-quality public text data will be exhausted within years. Data centers under development will each draw gigawatts of power, straining electrical grids. The scaling laws describe a power law where each gain requires disproportionately more resources
1
.More fundamentally, LLMs built primarily for text prediction lack experiential understanding. They cannot test hypotheses or probe environments. When asked how to stack common objects, models sometimes stumble, revealing gaps in common sense. "I do think there is a lot left in textual intelligence," says OpenAI chief scientist Jakub Pachocki, "but it's quite clear that humans don't reason only in words"
1
.Google DeepMind research scientist Jane Wang puts it bluntly: "You cannot just throw more data at it and expect it to magically improve"
1
.Computer scientist Louis Castricato felt he'd hit a dead end in his eighth year studying LLMs. "We basically have passed the point of doing real fundamental LLM research," he said. "Now it's just applications." He quit his Brown University doctoral program to launch Overworld, a startup building AI for physical environments that can understand and navigate worlds, not just words
2
.World models represent what many see as AI's next frontier. These systems teach AI agents and robotics how to react in dynamic environments. "Godmother of AI" Fei-Fei Li, founder of World Labs, calls world models "one of the most important and most overloaded terms in AI today." She explains: "Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics"
2
.Yann LeCun, who left Meta's chief AI scientist role to start Paris-based Advanced Machine Intelligence Labs, views world models as enabling AI agents "to predict the consequences of its own actions." He jokes that today's smartest systems "are not as smart as a house cat." A cat cannot code like an LLM, but it survives by understanding its environment
1
2
.Carnegie Mellon's computer science dean Martial Hebert, who has spent four decades researching robotics, notes chatbots cannot pick up a coffee mug. "There's all the geometry of the world, the dynamic of how I move my hand, the physical interaction of the contact with the cup," he explains. "This is much more complex than just predicting the next word in a sentence." Embodied AI represents the evolution of what used to be called robotics, applying AI advances from chatbots to build systems with environmental awareness
2
.Overworld is building video game worlds where environments adapt as virtual characters move through them. "We optimize for interaction above anything else," Castricato says. The Rhode Island startup has attracted venture capital from Kindred Ventures, which is also investing in Causal Labs for weather prediction and Extropic for specialized chips suited to world models. Managing partner Steve Jang sees world models as a promising frontier despite less obvious near-term applications than AI coding tools
2
.Many researchers now believe humanlike artificial general intelligence will require more than mastering language and images. It demands AI systems that reason about space, causality, and action consequences—especially for controlling humanoid robots, operating factories, and exploring other planets. The question facing the industry is whether AI for physical environments can deliver on this promise while overcoming data scarcity and computational power constraints that challenge traditional scaling approaches.
Summarized by
Navi
1
Technology

2
Policy and Regulation

3
Technology
