It all starts with data, most importantly 'synthetic data'. In a recent Reddit discourse among AI enthusiasts, OpenAI founding member Andrej Karpathy's remarks highlighting a critical limitation of LLMs -- their lack of a "thought process kind of data" -- sparked a debate. This raised the growing importance of training AI models with synthetic data that simulate human-like reasoning.
Karpathy's observation brings to light a significant gap in AGI advancements in the current AI landscape, which is a crucial challenge for LLMs. While LLMs are excellent at generating text based on patterns, they tend to imitate rather than reason, often producing statistically likely responses rather than engaging in structured and conscious reasoning. This limitation is rooted in the nature of their training data, which mostly consists of diverse, unstructured text from the internet.
While LLMs like GPT-4 have demonstrated impressive language capabilities, they still fall short when it comes to reasoning through complex problems in a manner that feels genuinely 'thoughtful'.
Anthropic CEO Dario Amodei, during a discussion with podcaster Lex Fridman, complemented this view, shedding light on how reinforcement learning with human feedback (RLHF) is used to make models more aligned with human preferences. "RLHF bridges the gap between the human and the model... It's not just about making the model appear smarter but enabling it to better communicate with humans," Amodei said.
Without sufficient 'thought process' data, these models often struggle with multi-step reasoning, complex decision-making, or maintaining coherence over long dialogues. Karpathy believes this gap is a major roadblock to achieving AGI, as true intelligence requires not just data but the ability to process and utilise it in a structured, purposeful manner.
To address this limitation, synthetic data has emerged as a promising solution. OpenAI, for example, is reportedly using an internal tool named 'Strawberry' to generate synthetic datasets structured to simulate this missing 'thought process' data.
These datasets aim to create iterative and recursive improvements and hope to help LLMs simulate structured thought by training on progressively refined synthetic data, ultimately leading to more robust and thoughtful models. This way, synthetic data is believed to potentially close the gap between current LLM abilities and the desired capabilities for AGI.
The growing interest in synthetic data is not confined to OpenAI. Other leaders in the AI industry are also exploring its potential.
Anthropic, for instance, is working on ways to generate 'infinite' training data that can bypass some of the limitations of traditional web scraping. According to Amodei, synthetic data offers a way to sidestep the biases and inconsistencies present in real-world data while also scaling AI models more effectively.
Amanda Askell, philosopher and member of technical staff at Anthropic, elaborated on the characteristics of AGI during a conversation with Fridman. She speculated that interacting with AGI would resemble engaging with "an extremely capable human", offering genuinely novel insights. Amanda envisioned scenarios where AI could solve problems requiring months of human effort, such as developing a novel proof in mathematics.
"If I just took something like that where I know a lot about an area and I came up with a novel issue or a novel solution to a problem, and I gave it to a model, and it came up with that solution, that would be a pretty moving moment for me," Askell further explained. Such capabilities demand data that goes beyond the superficial replication of existing knowledge.
Meanwhile, Hugging Face has introduced Cosmopedia, a vast synthetic dataset aimed at enriching the machine-learning community with high-quality, curated data. By using synthetic datasets, Hugging Face hopes to equip models with a better understanding of the world, which could drive improvements in various machine learning applications, from natural language processing to computer vision.
What Industry Leaders Predict
Industry leaders like OpenAI CEO Sam Altman and Microsoft AI CEO Mustafa Suleyman are fairly vocal about the potential timeline for AGI. Altman has predicted that AGI could be achieved as soon as 2025, while Suleyman has suggested that recursive improvements driven by synthetic data could accelerate this timeline to three to five years.
Raising critical questions about recognising AGI, Fridman asked Askell, "How long would you need to be locked in a room with an AGI to know this thing is AGI?" "This is going to feel iterative...It might just be that there's this continuous ramping up," she responded. Her comment reflects the gradual, evolutionary nature of AGI development rather than a single, definitive breakthrough.
Fridman also suggested that an AGI might generate outputs so profound that they surpass human capabilities. "Maybe ask it to generate a poem, and the poem it generates, you're like, 'Yeah, okay. Whatever you did there, I don't think a human can do that'," he speculated.
These observations underscore the importance of novelty and creativity, qualities synthetic data aims to cultivate within AI systems.
The predictions hinge on the ability to continuously refine AI models through recursive cycles, with each version of the model contributing to the improvement of the next.
This vision of recursive improvements is grounded in the idea that by training models on synthetic data that mimics the thought process, AI systems will become increasingly capable of sophisticated reasoning. This would allow them to move beyond their current limitations and exhibit the kind of intelligence that can even rival human cognition.