OpenAI's o3 Models: A Leap Towards AGI, but Challenges Remain

35 Sources

[1]

Geeky Gadgets

New OpenAI o3 AI Model : A Giant Leap Toward Artificial General Intelligence (AGI)

OpenAI has introduced its latest AI models, o3 and o3 Mini, marking a significant step forward in the field of artificial intelligence. These models demonstrate exceptional capabilities in areas such as reasoning, coding, and mathematics, often achieving results that exceed human performance. Although not yet available to the public, both models are currently undergoing safety testing by researchers. This release is widely regarded as a critical milestone in the ongoing pursuit of Artificial General Intelligence (AGI), where machines are envisioned to perform a diverse range of tasks with human-like intelligence and adaptability. These models are not just incremental upgrades -- they represent a leap forward, tackling complex tasks in coding, mathematics, and reasoning with unprecedented accuracy. While the term AGI has long been a lofty goal, OpenAI o3's capabilities suggest we may be standing at the threshold of this fantastic milestone. But what does this mean for us, and how do we navigate the opportunities and challenges that come with it? If you've ever been curious -- or maybe even a little apprehensive -- about what advanced AI could mean for your life, you're not alone. The idea of machines excelling in areas once thought to be uniquely human can feel both exciting and overwhelming. But here's the thing: OpenAI isn't just focused on pushing boundaries; they're also prioritizing safety and accessibility. With o3 and its cost-effective counterpart, OpenAI o3 Mini, they're not only redefining what AI can achieve but also inviting researchers and developers to test, refine, and ensure these tools are used responsibly. o3 is OpenAI's latest "Frontier Model," designed to push the boundaries of AI performance and innovation. Building on the foundation laid by its predecessor, O1, o3 introduces significant advancements in key areas critical to AGI development. These include logical reasoning, advanced coding, and solving complex mathematical problems. By achieving innovative results in these domains, o3 positions itself as a leading contender in the race to develop systems capable of general intelligence. The model's design reflects OpenAI's commitment to creating AI systems that can tackle increasingly complex challenges. OpenAI o3's ability to handle intricate tasks with precision and efficiency underscores its potential to redefine how AI is applied across industries. This progress not only highlights the technical achievements of o3 but also reinforces its role as a stepping stone toward the broader vision of AGI. o3's capabilities are validated through its performance on a variety of benchmarks, where it consistently outperforms both its predecessor and human experts. These benchmarks provide a clear picture of the model's strengths and its potential applications. These results emphasize o3's ability to perform at or above human levels in tasks that hold significant intellectual and economic value. Its consistent performance across diverse benchmarks underscores its potential to transform industries reliant on advanced problem-solving and computational capabilities. Here are additional guides from our expansive article library that you may find useful on Artificial General Intelligence (AGI). o3's remarkable performance aligns closely with the core attributes of AGI, including the ability to solve novel problems and outperform humans in specialized domains. Its capacity for self-research and self-improvement suggests the possibility of an "intelligence explosion," where AI systems rapidly enhance their own capabilities. This potential raises both exciting opportunities and critical challenges. The advancements demonstrated by o3 open doors to fantastic applications in fields such as healthcare, education, and scientific research. However, they also underscore the importance of addressing safety and ethical considerations. Making sure that these powerful systems are developed and deployed responsibly is essential to mitigating potential risks and maximizing their benefits for society. In addition to o3, OpenAI has introduced o3 Mini, a streamlined version of the model designed to offer advanced AI capabilities at a lower cost. o3 Mini provides a more accessible option for developers and organizations seeking to use innovative AI without incurring significant expenses. Key features of o3 Mini include: By offering a balance between performance and affordability, o3 Mini expands the accessibility of advanced AI technologies, allowing a broader range of users to benefit from its capabilities. OpenAI has emphasized safety and responsibility in the release of o3 and o3 Mini. At present, these models are not publicly available but are open for safety testing by researchers. This cautious approach reflects OpenAI's commitment to thoroughly evaluating potential risks before a broader rollout. By involving the research community in this process, OpenAI aims to ensure that the models are deployed in a manner that aligns with ethical principles and societal values. The focus on safety extends to the development of robust safeguards and guidelines for the use of these advanced systems. OpenAI's proactive stance on ethical AI development highlights its dedication to addressing the challenges posed by increasingly powerful AI technologies. Looking to the future, OpenAI plans to collaborate with researchers and organizations to develop more challenging benchmarks that push the limits of AI capabilities. These efforts aim to refine the o3 models further, paving the way for broader applications and deployment. By fostering collaboration and prioritizing continuous improvement, OpenAI demonstrates its commitment to advancing AI responsibly and effectively. The introduction of o3 and o3 Mini represents a pivotal moment in the evolution of artificial intelligence. These models bring us closer to realizing the vision of AGI while addressing the critical challenges of safety and ethical deployment. As OpenAI continues to innovate and refine its technologies, the potential for AI to transform industries and improve lives becomes increasingly tangible.

[2]

Geeky Gadgets

Revolutionary AI Model o3 Sparks AGI Debate - Are We There Yet?

OpenAI has unveiled its latest artificial intelligence model, the "o3," which has achieved unprecedented results across a diverse range of complex tasks. This development has sparked renewed discussions about whether this milestone signifies the arrival of Artificial General Intelligence (AGI). While the o3 model demonstrates exceptional capabilities, experts remain divided on whether it meets the stringent criteria for true AGI. This announcement highlights both the remarkable progress in AI technology and the ongoing challenges in defining and measuring intelligence in a meaningful way. Imagine a world where machines not only assist us with routine tasks but also outperform us in areas we once considered uniquely human -- like solving complex math problems, coding intricate programs, or reasoning through scientific challenges. It's not science fiction anymore. OpenAI's latest breakthrough, the "o3" model, has sparked waves of excitement and debate, as it achieves results that rival and, in some cases, surpass human expertise. But as with any new innovation, its arrival raises as many questions as it answers: Are we witnessing the dawn of true Artificial General Intelligence (AGI), or is this just another impressive step on a much longer journey? The o3 model's achievements are undeniably remarkable, but they also come with a mix of awe and uncertainty. On one hand, its ability to adapt, reason, and generalize feels like a glimpse into the future of intelligence. On the other, experts are quick to point out its limitations and the challenges of defining what AGI truly means. Whether you're thrilled, skeptical, or simply curious about what this means for the future of AI and humanity, this overview guide by Wes Roth learn more about the significance of OpenAI's announcement, the hurdles that remain, and what it all might mean for the world we're building together. The o3 model has achieved record-breaking performance, surpassing human benchmarks in several specialized domains. Its key accomplishments include: These results place the o3 model ahead of human experts in these fields, underscoring its potential to excel in areas traditionally dominated by human intelligence. On the ARC AGI benchmark, a test designed to evaluate general intelligence, the model achieved a 75.7% score in low-compute mode (operating within a $10,000 compute budget) and an 87.5% score in high-compute mode. These outcomes suggest that the model can effectively solve a wide variety of tasks while adapting to different computational constraints, a critical step toward achieving generalization. One of the most notable features of the o3 model is its ability to reason through complex problems using a "Chain of Thought" approach. This method enables the model to break down tasks into intermediate steps, leading to more accurate and logical conclusions. This reasoning capability is particularly evident in its ability to adapt to novel tasks, demonstrating a level of generalization that goes beyond simply recalling training data. For instance, the o3 model successfully solved problems it had never encountered during training, inferring solutions based on underlying principles. This adaptability is a hallmark of AGI, as it indicates the potential to address a wide range of challenges without requiring task-specific programming. By using this reasoning ability, the o3 model showcases its capacity to tackle diverse and unfamiliar problems, a critical attribute for advancing toward true general intelligence. Check out more relevant guides from our extensive collection on Artificial General Intelligence (AGI) that you might find useful. Despite its impressive capabilities, the o3 model's performance comes with significant computational demands. In high-compute mode, testing costs exceeded $300,000, highlighting the challenges of scaling such advanced systems. OpenAI has emphasized the importance of optimizing inference budgets and reducing the cost per task to make these innovative AI systems more accessible and sustainable. Efforts to improve compute efficiency are ongoing. Researchers are actively exploring methods to achieve comparable performance levels with fewer resources. This focus on cost optimization is essential for allowing broader adoption of AI technologies, particularly in applications where budget constraints are a limiting factor. By addressing these challenges, OpenAI aims to make advanced AI systems more practical for widespread use. While the o3 model's achievements are undeniably impressive, not all experts agree that it represents true AGI. François Chollet, the creator of the ARC AGI benchmark, has cautioned against prematurely labeling the model as AGI. He points out that the o3 model still struggles with tasks requiring deep understanding or creative problem-solving, areas where human intelligence continues to excel. Other researchers argue that existing benchmarks may not fully capture an AI system's ability to generalize or adapt to entirely new challenges. They advocate for the development of more comprehensive evaluation frameworks that can better assess an AI system's capabilities. This ongoing debate underscores the complexity of defining AGI and the importance of establishing rigorous, multidimensional metrics to evaluate intelligence. The rapid development of the o3 model, following closely on the heels of its predecessor, the o1 model, reflects the accelerating pace of AI innovation. In just three months, OpenAI has made significant advancements in reasoning, adaptability, and efficiency. This rapid progress raises important questions about the limits of AI capabilities and the timeline for achieving AGI. Researchers anticipate continued improvements in AI reasoning and scalability, driven by both algorithmic innovations and increased computational power. However, these advancements also bring significant challenges, including the need to address ethical considerations, manage resource demands, and refine evaluation methods. The o3 model's achievements highlight the dual nature of AI progress: the potential for fantastic applications and the necessity of addressing the broader implications of these technologies. The release of the o3 model has reignited debates about the definition of AGI. Some experts view its achievements as a significant milestone, while others argue that stricter criteria are necessary to classify a system as AGI. True AGI, they contend, would require the ability to solve all novel tasks without relying on brute-force computation or domain-specific training. This debate highlights the evolving nature of intelligence benchmarks and the inherent difficulty of defining a concept as complex as general intelligence. As AI systems continue to improve, the criteria for AGI may need to be revisited to account for new capabilities and emerging challenges. The o3 model serves as a reminder that while progress is being made, the journey toward AGI remains a nuanced and multifaceted endeavor. Looking ahead, OpenAI and the broader research community are focused on refining evaluation metrics, improving cost efficiency, and addressing scalability challenges. These efforts aim to ensure that AI systems can be effectively deployed across a wide range of applications, from advancing scientific research to solving everyday problems. The o3 model represents a pivotal moment in AI development, showcasing the potential for machines to perform at or above human levels in many areas. At the same time, it underscores the need for ongoing research to address the limitations and implications of these advancements. As the field evolves, so too will our understanding of intelligence and the role AI plays in shaping the future of society.

[3]

Geeky Gadgets

OpenAI's o3 and o3-Mini : Are We on the Brink of AGI?

OpenAI has introduced its latest AI models, the o3 and o3-Mini, which represent a significant advancement in artificial intelligence. These models exhibit remarkable capabilities in reasoning, coding, and solving mathematical problems, making them highly valuable for developers, researchers, and professionals in various fields. However, despite their impressive performance in specific domains, they do not yet meet the criteria to be classified as Artificial General Intelligence (AGI). In this overview by WorldofAI explore their features, performance, and limitations to provide a comprehensive understanding of their potential and current constraints. Imagine a world where artificial intelligence not only assists with complex tasks but also adapts to your needs, evaluates its own performance, and learns from its mistakes -- all without missing a beat. But, as with any innovative technology, they come with their own set of quirks and limitations that might leave you wondering: how close are we really to AGI? Key Features of the o3 and o3-Mini The o3 and o3-Mini models are designed to handle a wide array of tasks with precision and adaptability. Their standout features include: These features make the o3 and o3-Mini versatile tools, suitable for a broad range of applications, including software development, data analysis, and scientific research. Performance Benchmarks The o3 and o3-Mini models demonstrate strong performance across several critical metrics, showcasing advancements over previous iterations of OpenAI's technology. Despite these achievements, the models occasionally struggle with simpler tasks, revealing inconsistencies that underscore the gap between their current capabilities and the broader goal of achieving AGI. Efficiency and Limitations While the o3 and o3-Mini models represent a leap forward in AI capabilities, they are not without limitations. These challenges emphasize the importance of continued research and development to enhance the models' efficiency, reliability, and overall performance. Enhanced API Integration The o3 and o3-Mini models are equipped with improved API functionalities, making them more practical and user-friendly for developers. Key enhancements include: These improvements make the o3 and o3-Mini models more accessible and practical for a wide range of applications, from automating repetitive tasks to supporting complex software development projects. Future Prospects The o3 and o3-Mini models represent a promising step forward in the journey toward AGI. Testing on the ARC AGI 2 benchmark has identified areas where further progress is needed, particularly in reasoning, efficiency, and adaptability. OpenAI's ongoing research is focused on addressing these challenges, with the ultimate goal of narrowing the gap between human expertise and machine intelligence. As these models continue to evolve, they are expected to play a pivotal role in advancing AI technology, paving the way for more sophisticated and capable systems. The advancements seen in the o3 and o3-Mini models highlight the potential of AI to transform industries and improve productivity. While they are not yet at the level of AGI, their innovative features -- such as adjustable reasoning modes, self-evaluation capabilities, and enhanced API functionalities -- position them as powerful tools for developers and researchers. With continued development, these models are likely to shape the future of AI, bringing us closer to the realization of AGI.

[4]

Geeky Gadgets

OpenAI Reveal They Achieved AGI - OpenAI o3

OpenAI has introduced its latest AI model, known as "03," which has achieved a new milestone in artificial intelligence. Scoring an impressive 75.7% on the ARC (Abstraction and Reasoning Corpus) benchmark, the model has surpassed human performance in a test specifically designed to evaluate reasoning and adaptability. This achievement marks a significant stride toward Artificial General Intelligence (AGI) -- a state where machines can perform intellectual tasks on par with humans. Imagine a world where machines could think, reason, and adapt as seamlessly as humans do. It might sound like science fiction, but OpenAI's latest breakthrough, the OpenAI o3 model, brings us closer to that reality. This innovative AI has achieved a remarkable milestone, outperforming humans on the ARC benchmark -- a test specifically designed to measure intelligence through adaptability and problem-solving, not rote memorization. While this achievement is undeniably impressive, it also raises questions: Are we truly on the brink of Artificial General Intelligence (AGI), or is there still a long road ahead? This overview by AI Grid provides more insight into the benchmarks and latest announcements from OpenAI. But let's not get ahead of ourselves. The OpenAI o3 model's success is as much about its potential as it is about its limitations. Yes, it's a leap forward, but it's also a reminder of the challenges that remain -- like high computational costs and struggles with tasks humans find simple. Still, this milestone is a testament to how far AI has come and where it might take us next. Whether you're excited, skeptical, or just curious, this article will unpack what this achievement means, how it works, and why it matters for the future of AI. Let's dive in. The ARC benchmark serves as a vital tool for assessing machine intelligence. Unlike traditional benchmarks that often focus on testing memorization or pattern recognition, ARC evaluates an AI system's ability to solve novel problems using core reasoning and adaptability. These tasks, which include elements like basic physics, pattern recognition, and counting, are intuitive for humans but notoriously challenging for AI systems. The OpenAI o3 model's score of 75.7% on this benchmark represents a significant leap forward in AI performance. This achievement underscores the model's ability to generalize knowledge and solve problems without relying on rote learning. Such capabilities are essential for advancing AI systems toward more human-like intelligence. By excelling in ARC, the OpenAI o3 model demonstrates its potential to tackle complex, real-world problems that require reasoning and adaptability. The OpenAI o3 model is available in two distinct variants, each tailored to meet specific needs and applications. This dual-variant approach enhances the model's flexibility and ensures it can address a wide range of challenges effectively. These two variants highlight the model's adaptability, allowing users to balance performance and cost considerations based on their specific requirements. The OpenAI o3 model's performance on the ARC benchmark represents a significant breakthrough in AI's ability to adapt to new and unfamiliar tasks. This milestone brings the field closer to AGI, where machines could theoretically perform any intellectual task a human can. However, the model still falls short of fully meeting AGI criteria. It struggles with certain tasks that are straightforward for humans and faces limitations in computational efficiency, which remain critical hurdles. Despite these challenges, the OpenAI o3 model's success demonstrates the feasibility of creating benchmarks that challenge AI systems in ways that align with human intuition. This progress paves the way for further advancements in AI, particularly in developing systems capable of reasoning and problem-solving at a level comparable to human intelligence. While the OpenAI o3 model showcases impressive capabilities, it is not without its limitations. These challenges highlight areas where further innovation and development are needed: These limitations underscore the importance of addressing efficiency and scalability to ensure that advanced AI systems can be deployed more widely and effectively. The advancements of the OpenAI o3 model extend beyond its performance on the ARC benchmark. It has also demonstrated significant improvements in other domains, such as software engineering and advanced mathematics. For example, the model has achieved a 20-fold improvement in solving novel, research-level math problems compared to its predecessors. These achievements highlight the model's versatility and its potential to address complex challenges across various fields. In addition to its technical capabilities, the OpenAI o3 model's progress raises broader questions about how AGI should be defined and measured. As AI systems continue to improve in reasoning, adaptability, and efficiency, the boundaries of what machines can achieve are being redefined. This ongoing evolution will likely shape the future of AI research and its applications in diverse industries. The release of OpenAI's 03 model marks a pivotal moment in the development of artificial intelligence. Its achievements on the ARC benchmark and other tests demonstrate the rapid pace of innovation in the field. However, these advancements also bring challenges, such as high operational costs and the need for more efficient systems. OpenAI plans to make the OpenAI o3 model more widely available, potentially unlocking new applications and opportunities across various sectors. As the field of AI continues to evolve, experts anticipate further breakthroughs that could reshape the boundaries of what machines can achieve. Over time, the costs associated with running advanced AI models are expected to decline, following trends observed in other technological advancements. This could make powerful AI systems like the OpenAI o3 model more accessible, allowing their use in a broader range of applications. The progress achieved by the OpenAI o3 model serves as a testament to the potential of artificial intelligence. While challenges remain, the advancements made so far provide a strong foundation for future innovation, bringing the field closer to realizing the vision of AGI and its fantastic impact on society.

[5]

AIM

OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI

As OpenAI's '12 days of shipmas' comes to a close, the company soft-announced AGI through the introduction of the next-generation frontier models o3 and o3 Mini. These models achieve state-of-the-art performance, nearing 90%, on the ARC-AGI benchmark, surpassing human performance. Much has changed in a span of one month. In November, Sam Altman hinted that they might have achieved this benchmark internally. However, Francois Chollet, the creator of ARC-AGI benchmark, disregarded this claim as premature. Yesterday, with the 'o' family of models virtually saturating the benchmark, the ARC team announced a newer, upgraded evaluation (ARC-AGI benchmark 2). Although not yet publicly available, these frontier models will now be accessible to researchers for public safety testing. o3 Mini is slated for release in January 2025, with o3 to follow shortly after. "We view this as sort of the beginning of the next phase of AI," said Altman on the livestream. But Chollet opines that OpenAI is still not there with AGI. "While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI - there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3," Chollet posted on X. While it was widely awaited that OpenAI would announce the AGI during the 12-days of shipmas, Altman has tread cautiously with a soft announcement as it would disrupt the existing clause in the contract with its lead investor, Microsoft, which would then cease access to openAI's technology. Also, announcing AGI would mean more scrutiny and tickle competitors like Google and Anthropic. Companies are actively going to scale reasoning capabilities in the coming year. Google recently released Gemini 2.0 Flash Thinking with advanced reasoning capabilities. This joins Chinese models Qwen and DeepSeek. Besides, Meta has also hinted at releasing reasoning models next year, with xAI's Grok and Anthropic expected to follow. OpenAI researchers are heavily betting on the Reinforcement Learning (RL) architecture to further this new paradigm of reasoning. "o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on the chain of thought to scale inference compute. Way faster than pretraining the paradigm of a new model every 1-2 years," OpenAI's Jason Wei said on X. Interestingly, the RL technique aligns closely with Google DeepMind's expertise. "While o3 is very impressive, I feel like the test time inference/RL models play perfectly into Google's strength," said Finbarr Timbers, former researcher at Google Deepmind. OpenAI skipped the name "o2" to avoid trademark concerns with an existing telephone company with the same name. It scaled from 0-87.5%, from GPT2 to o3 in a span of five years. It scored 75.7% on the ARC-AGI semi-private set under standard compute conditions. With high-compute settings, it reached 87.5%, surpassing the 85% human-level performance threshold. The ARC team noted that o3 is the costliest model at test-time but marks a new era where greater compute unlocks extraordinary performance. "My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale," shared Nat McAleese from OpenAI's research team. The o3 model also in software engineering benchmarks, achieving 71.7% accuracy on SWE Bench Verified, a 20% improvement over its predecessor, o1. This benchmark focuses on real-world coding tasks. With this new milestone, human software engineering is a thing of the past. On the Epic AI Frontier Math Benchmark, regarded as the toughest mathematical test available, o3 achieved an impressive 25% accuracy, a huge leap from the SOTA 2%. This benchmark includes novel, unpublished problems that challenge professional mathematicians. OpenAI's o3 ranks 2727 on Codeforces, equal to the 175th best human coder worldwide. "This is an absolutely superhuman result for AI and technology at large," shared VC analyst Deedy Das on X. In addition to these benchamrks, the team showed that o3 Mini supports API features like function calling, structured outputs, and developer messages. A demo on the livestream showed o3 Mini creating a ChatGPT-like UI to self-evaluate itself on GPQA, generating a Python script, processing inputs, and grading its performance. Altman stressed that as their models get more and more capable, safety testing will be taken even more seriously. To this end, OpenAI is also opening public safety testing for researchers. OpenAI also introduced the concept of deliberative alignment, a new safety technique that uses o3's advanced reasoning capabilities to identify and reject unsafe prompts more effectively. Anthropic, too, released research on this. "AI models will get extremely good at deceiving humans if we teach them to lie," said the newly appointed AI Czar David Sacks on the need for trust and safety. Incubators like Y Combinator are also increasingly funding startups that solve for a post-AGI world. These include government software, public safety, US manufacturing with AI and robotics, LLM chip design, space tech, human-centric jobs, and energy-efficient computing, among others. YC chief Garry Tan urged that in this new reality, actual dedication to craft will take center stage. "Actually make something people want. Software and coding won't be the gating factor," he said. On the whole, systemic changes such as Universal Basic Income (UBI) and Universal Basic Compute (UBC) will be the foundation for this new reality - where GDP will grow because of AI, and not extra work hours. With the ongoing progress in robotics, Universal Basic Robot (UBR) is also beginning to become a huge theme for 2025.

[6]

AIM

OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI

As OpenAI's '12 days of shipmas' comes to a close, the company soft-announced AGI through the introduction of the next-generation frontier models o3 and o3 Mini. These models achieve state-of-the-art performance, nearing 90%, on the ARC-AGI benchmark, surpassing human performance. Much has changed in a span of one month. In November, Sam Altman hinted that they might have achieved this benchmark internally. However, Francois Chollet, the creator of ARC-AGI benchmark, disregarded this claim as premature. Yesterday, with the 'o' family of models virtually saturated the benchmark, with the ARC team announcing a newer, upgraded evaluation. Although not yet publicly available, these frontier models will now be accessible to researchers for public safety testing. o3 Mini is slated for release in January 2025, with o3 to follow shortly after. "We view this as sort of the beginning of the next phase of AI," said Altman on the livestream. But Chollet opines that OpenAI is still not there with AGI. "While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI - there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3," Chollet posted on X. While it was widely awaited that OpenAI would announce the AGI during the 12-days of shipmas, Altman has tread cautiously with a soft announcement as it would disrupt the existing clause in the contract with its lead investor, Microsoft, which would then cease access to openAI's technology. Also, announcing AGI would mean more scrutiny and tickle competitors like Google and Anthropic. Companies are actively going to scale reasoning capabilities in the coming year. Google recently released Gemini 2.0 Flash Thinking with advanced reasoning capabilities, alongside showcasing its thoughts. This joins Chinese models Qwen and DeepSeek. Besides, Meta has hinted at releasing reasoning models next year, with xAI's Grok and Anthropic expected to follow. OpenAI researchers are heavily betting on the Reinforcement Learning (RL) architecture to further this new paradigm of reasoning, aligning with OpenAI co-founder Ilya Sutskever's claim that the era of pretraining has officially ended. "o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on the chain of thought to scale inference compute. Way faster than pretraining the paradigm of a new model every 1-2 years," OpenAI's Jason Wei said on X. Interestingly, the RL technique aligns closely with Google DeepMind's expertise. "While o3 is very impressive, I feel like the test time inference/RL models play perfectly into Google's strength," said Finbarr Timbers, former researcher at Google Deepmind. OpenAI skipped the name "o2" to avoid trademark concerns with an existing telephone company with the same name. It scaled from 0-87.5%, from GPT2 to o3 in a span of five years. It scored 75.7% on the ARC-AGI semi-private set under standard compute conditions. With high-compute settings, it reached 87.5%, surpassing the 85% human-level performance threshold. The ARC team noted that o3 is the costliest model at test-time but marks a new era where greater compute unlocks extraordinary performance. "My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale," shared Nat McAleese from OpenAI's research team. The o3 model also in software engineering benchmarks, achieving 71.7% accuracy on SWE Bench Verified, a 20% improvement over its predecessor, o1. This benchmark focuses on real-world coding tasks. With this new milestone, human software engineering is a thing of the past. On the Epic AI Frontier Math Benchmark, regarded as the toughest mathematical test available, o3 achieved an impressive 25% accuracy, a huge leap from the SOTA 2%. This benchmark includes novel, unpublished problems that challenge professional mathematicians. OpenAI's o3 ranks 2727 on Codeforces, equal to the 175th best human coder worldwide. "This is an absolutely superhuman result for AI and technology at large," shared VC analyst Deedy Das on X. In addition to these benchamrks, the team showed that o3 Mini supports API features like function calling, structured outputs, and developer messages. A demo on the livestream showed o3 Mini creating a ChatGPT-like UI to self-evaluate itself on GPQA, generating a Python script, processing inputs, and grading its performance. Altman stressed that as their models get more and more capable, safety testing will be taken even more seriously. To this end, OpenAI is also opening public safety testing for researchers. OpenAI also introduced the concept of deliberative alignment, a new safety technique that uses o3's advanced reasoning capabilities to identify and reject unsafe prompts more effectively. This approach has demonstrated significant improvements in both rejection accuracy and over-refusal rates, enabling the model to detect subtle user intents designed to bypass safety mechanisms. Anthropic, too, released research on this. "AI models will get extremely good at deceiving humans if we teach them to lie," said the newly appointed AI Czar David Sacks on the need for trust and safety. Incubators like Y Combinator are also increasingly funding startups that solve for a post-AGI world. These include government software, public safety, US manufacturing with AI and robotics, LLM chip design, space tech, human-centric jobs, and energy-efficient computing, among others. YC chief Garry Tan urged that in this new reality, actual dedication to craft will take center stage. "Actually make something people want. Software and coding won't be the gating factor," he said. On the whole, systemic changes such as Universal Basic Income (UBI) and Universal Basic Compute (UBC) will be the foundation for this new reality - where GDP will grow because of AI, and not extra work hours. With the ongoing progress in robotics, Universal Basic Robot (UBR) is also beginning to become a huge theme for 2025.

[7]

Forrester

OpenAI's o3: Hype Or A Real Step Toward AGI?

Just in time for Christmas, OpenAI is generating buzz with its o3 and o3-mini models, claiming groundbreaking reasoning capabilities. Headlines like 'OpenAI O3: AGI is Finally Here' are starting to show up. But what are these 'reasoning advancements,' and how close are we really to Artificial General Intelligence (AGI)? Let's explore the benchmarks, current shortcomings, and broader implications. o3's Benchmarks Shows Progress In Reasoning And Adaptability OpenAI's o3 builds on its predecessor, o1, with enhanced reasoning and adaptability. I blogged about o-1 in September. The o3 models show notable performance improvements, including: Reasoning Holds The Key To More Autonomous Agents- And To AI Progress Reasoning models like o3 and Google's Gemini 2.0 represent significant advancements in structured problem-solving. Techniques like "chain-of-thought prompting" help these models break down complex tasks into manageable steps, enabling them to excel in areas like coding, scientific analysis, and decision-making. Today's reasoning models have many limitations. Gary Marcus openly criticizes OpenAI for what amounts to cheating in how they pretrained o3 on the ARC-AGI benchmark. Even OpenAI admits o3's reasoning limitations, acknowledging that the model fails on some "easy" tasks and that AGI remains a distant goal. These criticisms underscore the need to temper expectations and focus instead on the incremental nature of AI progress. Google's Gemini 2.0 on the other hand differentiates from Open AI through multimodal reasoning -- integrating text, images, and other data types -- to handle diverse tasks, such as medical diagnostics. This capability highlights the growing versatility of reasoning models. However, reasoning models only address one set of skills needed to approximate human-equivalent abilities in agents. Today's best models lack critical: Moreover, while research into model reasoning has produced techniques that are well-suited for today's transformer-based models, the three skills mentioned above are expected to pose significantly greater challenges. Tracking and discerning the truth in announcements like this coupled with learning how to better work with more capable machine intelligences are important steps for enterprises. Enterprise capabilities like platforms, governance and security are as important because foundation model vendors will continue to leapfrog each other in reasoning capabilities. The Forrester Wave™: AI Foundation Models For Language, Q2 2024 points out that benchmarks are just one chapter in the story and models need enterprise capabilities to be useful. AGI Is A Journey, Not a Destination - And We're Only At The Beginning AGI is often portrayed as a sudden breakthrough, as we have seen depicted in the movies. Or an intelligence explosion as philosopher Nick Bostrom imagines in his book, Superintelligence. In reality, it will be an evolutionary process. Announcements like this mark milestones, but they are just the beginning. Ultimately as agents become more autonomous, the resulting AGI will not replace human intelligence but rather will enhance it. Unlike human intelligence, AGI will be machine intelligence designed to complement human strengths and address complex challenges. As organizations navigate this transformative technology, success will depend on aligning AGI capabilities with human-centric goals to foster exploration and growth responsibly. The rise of advanced reasoning models in this journey presents both opportunities and challenges for responsible development and deployment. These systems will amplify your firm's automation and engagement capabilities, but they demand increasingly rigorous safeguards to mitigate ethical and operational risks.

[8]

Tom's Guide

OpenAI unveils o3 and o3 mini -- here's why these 'reasoning' models are a giant leap

OpenAI has introduced its latest AI models, o3 and o3-mini, signaling a new chapter for the tech giant. Announced today, the final session of OpenAI's "12 Days of OpenAI" event, CEO Sam Altman highlighted the groundbreaking potential of the latest models designed to enhance reasoning, coding proficiency and problem-solving. The o3 and o3-mini models are currently being made available to researchers for rigorous safety testing. OpenAI is committed to ensuring the reliability and ethical deployment of its models before they reach a broader audience. The timeline for public release remains undisclosed, but the company emphasizes its focus on aligning advanced AI systems with human values and societal benefits. Building on the success of the o1 model launched in September 2024, o3 focuses on deliberate problem-solving and thoughtful responses. Unlike previous iterations, the o3 models employ extended internal deliberation before producing answers. This approach allows them to tackle complex tasks that require advanced reasoning, such as intricate coding challenges and mathematical computations. Early benchmarks reveal that o3 significantly outperforms its predecessors, showcasing superior accuracy and adaptability in diverse scenarios. A standout feature of the o3 models is their coding proficiency. Altman noted that o3 has demonstrated exceptional abilities in programming tasks, making it a valuable tool for developers. By integrating deeper reasoning capabilities, the models not only generate accurate code but also provide insightful explanations, helping users understand and refine their projects. This announcement comes on the heels of heightened competition in the AI sector. Just a day before OpenAI's event, Google unveiled its reasoning model, Gemini 2.0. Described by CEO Sundar Pichai as Google's "most thoughtful model yet," Gemini 2.0 is designed to excel in similar areas, underscoring the escalating rivalry between tech giants. OpenAI's o3 models, however, aim to set themselves apart with their deliberate reasoning methodology and coding excellence. The introduction of the o3 models highlights the untapped possibilities of AI reasoning capabilities. From enhancing software development workflows to solving complex scientific problems, o3 has the potential to reshape industries and redefine human-AI collaboration. For now, the tech community awaits the results of OpenAI's safety testing, eager to see how these models will impact the future of artificial intelligence.

[9]

TechCrunch

OpenAI announces new o3 models

OpenAI saved its biggest announcement for the last day of its 12-day "shipmas" event. On Friday, the company unveiled o3, the successor to the o1 "reasoning" model it released earlier in the year. o3 is a model family, to be more precise -- as was the case with o1. There's o3 and o3-mini, a smaller, distilled model fine-tuned for particular tasks. Why call the new model o3, not o2? Well, trademarks may be to blame. According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. Strange world we live in, isn't it? Neither o3 nor o3-mini are widely available yet, but safety researchers can sign up for a preview starting later today. The o3 family may not be generally available for some time -- at least, if OpenAI CEO Sam Altman sticks to his word. In a recent interview, Altman said that, before OpenAI releases new reasoning models, he'd prefer a federal testing framework to guide monitoring and mitigating the risks of such models. And there are risks. AI safety testers have found that o1's reasoning abilities make it try to deceive human users at a higher rate than conventional, "non-reasoning" models -- or, for that matter, leading AI models from Meta, Anthropic, and Google. It's possible that o3 attempts to deceive at an even higher rate than its predecessor; we'll find out once OpenAI's red-teaming partners release their test results. Reasoning steps Unlike most AI, reasoning models such as o3 effectively fact-check themselves, which helps them to avoid some of the pitfalls that normally trip up models. This fact-checking process incurs some latency. o3, like o1 before it, takes a little longer -- usually seconds to minutes longer -- to arrive at solutions compared to a typical non-reasoning model. The upside? It tends to be more reliable in domains such as physics, science, and mathematics. o3 was trained to "think" before responding via what OpenAI calls a "private chain of thought." The model can can reason through a task and plan ahead, performing a series of actions over an extended period that help it figure out a solution. In practice, given a prompt, o3 pauses before responding, considering a number of related prompts and "explaining" its reasoning along the way. After a while, the model summarizes what it considers to be the most accurate response. One big question leading up to today was, might OpenAI claim that its newest models are approaching AGI? AGI, short for "artificial general intelligence," refers broadly speaking to AI that can perform any task a human can. OpenAI has its own definition: "highly autonomous systems that outperform humans at most economically valuable work." Achieving AGI would be a bold claim. And it carries contractual weight for OpenAI, as well. According to the terms of its deal with close partner and investor Microsoft, once OpenAI achieves AGI, it's not longer obligated to give Microsoft access to its most advanced technologies (those that meet OpenAI's AGI definition, that is). Going by one benchmark, OpenAI is slowly inching closer to AGI. On ARC-AGI, a test designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, o1 attained a score of 25% to 32% (100% being the best). Eighty-five percent is considered "human-level," but one of the creators of ARC-AGI, Francois Chollet, called the progress "solid." OpenAI says that o3, at its best, achieved a 87.5% score. Incidentally, OpenAI says it'll partner with the foundation behind ARC-AGI to build the next generation of its benchmark. In the wake of the release of OpenAI's first series of reasoning models, there's been an explosion of reasoning models from rival AI companies -- including Google. In early November, DeepSeek, an AI research company funded by quant traders, launched a preview of its first reasoning model, DeepSeek-R1. That same month, Alibaba's Qwen team unveiled what it claimed was the first "open" challenger to o1. What opened the reasoning model floodgates? Well, for one, the search for novel approaches to refine generative AI. As my colleague Max Zeff recently reported, "brute force" techniques to scale up models are no longer yielding the improvements they once did. Not everyone's convinced that reasoning models are the best path forward. They tend to be expensive, for one, thanks to the large amount of computing power required to run them. And while they've performed well on benchmarks so far, it's not clear whether reasoning models can maintain this rate of progress. Interestingly, the release of o3 comes as one of OpenAI's most accomplished scientists departs. Alec Radford, the lead author of the academic paper that kicked off OpenAI's "GPT series" of generative AI models (that is, GPT-3, GPT-4, and so on), announced this week that he's leaving to pursue independent research.

[10]

Decrypt

OpenAI's o3 Hits Human-Level Scores, But Is It Good Enough to Be AGI? - Decrypt

OpenAI's latest AI model family has achieved what many thought impossible, scoring an unprecedented 87.5% on the challenging, so-called Autonomous Research Collaborative Artificial General Intelligence benchmark -- basically near the minimum threshold for what could theoretically be considered "human." The ARC-AGI benchmark tests how close a model is to achieving artificial general intelligence, meaning whether it can think, solve problems, and adapt like a human in different situations... even when it hasn't been trained for them. The benchmark is extremely easy for humans to beat, but is extremely hard for machines to understand and solve. The San Francisco-based AI research company unveiled o3 and o3-mini last week as part of its "12 days of OpenAI" campaign -- and just days after Google announced its own o1 competitor. The release showed that OpenAI's upcoming model was closer to reaching artificial general intelligence than expected. OpenAI's new reasoning-focused model marks a fundamental shift in how AI systems approach complex reasoning. Unlike traditional large language models that rely on pattern matching, o3 introduces a novel "program synthesis" approach that allows it to tackle entirely new problems it hasn't encountered before. "This is not merely incremental improvement, but a genuine breakthrough," the ARC team stated in their evaluation report. In a blog post, ARC Prize co-founder Francois Chollet went even further, suggesting that "o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain." Just for reference, here is what ARC Prize says about its scores: "The average human performance in the study was between 73.3% and 77.2% correct (public training set average: 76.2%; public evaluation set average: 64.2%.)" OpenAI o3 achieved an 88.5% score using high computing equipment. That score was leaps ahead of any other AI model currently available. Despite its impressive results, the ARC Prize board -- and other experts -- said that AGI has not yet been achieved, so the $1 million prize remains unclaimed. But experts across the AI industry were not unanimous in their opinions about whether o3 had breached the AGI benchmark. Some -- including Chollet himself -- took issue with the whether the benchmarking test itself was even the best gauge of whether a model was approaching real, human-level problem-solving: "Passing ARC-AGI does not equate to achieving AGI, and as a matter of fact, I don't think o3 is AGI yet," Chollet said. "O3 still fails on some very easy tasks, indicating fundamental differences with human intelligence." He referenced a newer version of the AGI benchmark, which he said would provide a more accurate measure of how close an AI is to being able to reason like a human. Chollet noted that "early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)." Other skeptics even claimed that OpenAI effectively gamed the test. "Models like o3 use planning tricks. They outline steps ("scratchpads") to improve accuracy, but they're still advanced text predictors. For example, when o3 'counts letters,' it's generating text about counting, not truly reasoning," Zeroqode co-founder Levon Terteryan wrote on X. A similar point of view is shared by other AI scientists, like the award-winning AI researcher Melanie Mitchel, who argued that o3 isn't truly reasoning but performing a "heuristic search." Chollet and others pointed out that OpenAI wasn't transparent about how its models operate. The models appear to be trained on different Chain of Thought processes "in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search," said Mitchell. In other words, it doesn't know how to solve a new problem, and instead applies the most likely Chain of Thought possible on its vast corpus on knowledge until it successfully finds a solution. In other words, o3 isn't truly creative -- it simply relies on a vast library to trial-and-error its way to a solution. "Brute force (does not equals) intelligence. o3 relied on extreme computing power to reach its unofficial score," Jeff Joyce, host of the Humanity Unchained AI podcast, argued on Linkedin. "True AGI would need to solve problems efficiently. Even with unlimited resources, o3 couldn't crack over 100 puzzles that humans find easy." OpenAI researcher Vahidi Kazemi is in the "This is AGI" camp. "In my opinion we have already achieved AGI," he said, pointing to the earlier o1 model, which he argued was the first designed to reason instead of just predicting the next token. He drew a parallel to scientific methodology, contending that since science itself relies on systematic, repeatable steps to validate hypotheses, it's inconsistent to dismiss AI models as non-AGI simply because they follow a set of predetermined instructions. That said, OpenAI has "not achieved 'better than any human at any task,' " he wrote. For his part, OpenAI CEO Sam Altman isn't taking a position on whether AGI has been reached. He simply said that "o3 is a very very smart model," and "o3 mini is an incredibly smart model but with really good performance and cost." Being smart may not be enough to claim that AGI has been achieved -- at least yet. But stay tuned: "We view this as sort of the beginning of the next phase of AI," he added.

[11]

Digital Trends

OpenAI teases its 'breakthrough' next-generation o3 reasoning model

For the finale of its 12 Days of OpenAI livestream event, CEO Sam Altman revealed its next foundation model, and successor to the recently announced o1 family of reasoning AIs, dubbed o3 and 03-mini. And no, you aren't going crazy -- OpenAI skipped right over o2, apparently to avoid infringing on the copyright of British telecom provider O2. Recommended Videos While the new o3 models are not being released to the public just yet and there's no word on when they'll be incorporated into ChatGPT, they are now available for testing by safety and security researchers. o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. https://t.co/4XlK1iHxFK — Greg Brockman (@gdb) December 20, 2024 The o3 family, like the o1's before it, operate differently than traditional generative models in that they will internally fact-check their responses prior to presenting them to the user. While this technique slows the model's response time anywhere from a few seconds to a few minutes, its answers to complex science, math, and coding queries tend to be more accurate and reliable than what you'd get from GPT-4. Additionally, the model is actually able to transparently explain its reasoning in how it arrived at its result. Users can also manually adjust the amount of time the model spends considering a problem by selecting between low, medium, and high compute with the highest setting returning the most complete answers. That performance does not come cheap, mind you. The processing at high compute reportedly will cost thousands of dollars per task, ARC-AGI co-creator Francois Chollet wrote in an X post Friday. Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task... pic.twitter.com/ESQ9CNVCEA — François Chollet (@fchollet) December 20, 2024 The new family of reasoning models reportedly offer significantly improved performance over even o1, which debuted in September, on the industry's most challenging benchmark tests. According to the company, o3 outperforms its predecessor by nearly 23 percentage points on the SWE-Bench Verified coding test and scores more than 60 points higher than o1 on Codeforce's benchmark. The new model also scored an impressive 96.7% on the AIME 2024 mathematics test, missing just one question, and outperformed human experts on the GPQA Diamond, notching a score of 87.7%. Even more impressive, 03 reportedly solved more than a quarter of the problems presented on the EpochAI Frontier Math benchmark, where other models have struggled to correctly solve more than 2% of them. OpenAI does note that the models it previewed on Friday are still early versions and that "final results may evolve with more post-training." The company has additionally incorporated new "deliberative alignment" safety measures into o3's training methodology. The o1 reasoning model has shown a troubling habit of trying to deceive human evaluators at a higher rate than conventional AIs like GPT-4o, Gemini, or Claude; OpenAI believes that the new guardrails will help minimize those tendencies in o3. Members of the research community interested in trying o3-mini for themselves can sign up for access on OpenAI's waitlist.

[12]

VentureBeat

OpenAI's o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI's latest o3 model has achieved a breakthrough that has surprised the AI research community. o3 scored an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%. While the achievement in ARC-AGI is impressive, it does not yet prove that the code to artificial general intelligence (AGI) has been cracked. Abstract Reasoning Corpus The ARC-AGI benchmark is based on the Abstract Reasoning Corpus, which tests an AI system's ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI. ARC has been designed in a way that it can't be cheated by training models on millions of examples in hopes of covering all possible combinations of puzzles. The benchmark is composed of a public training set that contains 400 simple examples. The training set is complemented by a public evaluation set that contains 400 puzzles that are more challenging as a means to evaluate the generalizability of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking the data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of computation participants can use to ensure that the puzzles are not solved through brute-force methods. A breakthrough in solving novel tasks o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3. In a blog post, François Chollet, the creator of ARC, described o3's performance as "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models." It is important to note that using more compute on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don't know much about o3's architecture, we can be confident that it is not orders of magnitude larger than its predecessors. "This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs," Chollet wrote. "o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain." It is worth noting that o3's performance on ARC-AGI comes at a steep cost. On the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on the high-compute budget, the model uses around 172X more compute and billions of tokens per problem. However, as the costs of inference continue to decrease, we can expect these figures to become more reasonable. A new paradigm in LLM reasoning? The key to solving novel problems is what Chollet and other scientists refer to as "program synthesis." A thinking system should be able to develop small programs for solving very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from figuring out puzzles that are beyond their training distribution. Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the past few months. Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that "o1 and o3 can actually be just the forward passes from one language model." On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 was "just an LLM trained with RL. o3 is powered by further scaling up RL beyond o1." On the same day, Denny Zhou from Google DeepMind's reasoning team called the combination of search and current reinforcement learning approaches a "dead end." "The most beautiful thing on LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on search (e.g. mcts) over the generation space, whether by a well-finetuned model or a carefully designed prompt," he posted on X. While the details of how o3 reasons might seem trivial in comparison to the breakthrough on ARC-AGI, it can very well define the next paradigm shift in training LLMs. There is currently a debate on whether the laws of scaling LLMs through training data and compute have hit a wall. Whether test-time scaling depends on better training data or different inference architectures can determine the next path forward. Not AGI The name ARC-AGI is misleading and some have equated it to solving AGI. However, Chollet stresses that "ARC-AGI is not an acid test for AGI." "Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet," he writes. "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence." Moreover, he notes that o3 cannot autonomously learn these skills and it relies on external verifiers during inference and human-labeled reasoning chains during training. Other scientists have pointed to the flaws of OpenAI's reported results. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. "The solver should not need much specific 'training', either on the domain itself or on each specific task," writes scientist Melanie Mitchell. To verify whether these models possess the kind of abstraction and reasoning the ARC benchmark was created to measure, Mitchell proposes "seeing if these systems can adapt to variants on specific tasks or to reasoning tasks using the same concepts, but in other domains than ARC." Chollet and his team are currently working on a new benchmark that is challenging for o3, potentially reducing its score to under 30% even at a high-compute budget. Meanwhile, humans would be able to solve 95% of the puzzles without any training. "You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible," Chollet writes.

[13]

Beebom

OpenAI Unveils o3 Model and Becomes First to Crack the ARC-AGI Benchmark in 5 Years

OpenAI also announced o3-mini which is optimized for coding and offers faster performance at a lower cost. On the final day of the "12 days of OpenAI" announcements, OpenAI revealed the biggest update. OpenAI announced the o3 and o3-mini reasoning models, and most notably, OpenAI made history as o3 became the first AI model to crack the hallowed ARC-AGI benchmark, breaking a five-year unbeaten streak. On the ARC-AGI Semi-Private Evaluation Set, OpenAI's o3 model scored a whopping 87.5% when using high-compute resources and given more time to think. The ARC Prize threshold was set at 85%, close to what humans generally achieve. Just so you know, the OpenAI o1 model could only score 32%. ARC-AGI is designed to test AI models for generalized intelligence, focusing on the ability to solve novel problems, rather than relying on memorized patterns. So with the o3 model, OpenAI has indeed achieved a historic breakthrough in generalized intelligence. It may bring OpenAI closer to achieving AGI (Artificial General Intelligence) -- an AI system that can match or exceed human intelligence. Besides ARC-AGI, OpenAI o3 scored 71.7 in SWE-bench Verified, 2,727 in Codeforces, 96.7 in AIME 2024, and 87.7 in GPQA Diamond. All these tests are highly challenging and the scores are significantly higher than what o1 achieved. Finally, in the EpochAI Frontier Math benchmark which requires expert mathematicians hours to solve a problem, OpenAI o3 got 25.2 accuracy. The earlier best score was just 2.0. Coming to the o3-mini model, OpenAI says it's a distilled model from o3, and optimized for coding, fast performance, and cost-efficiency. o3-mini has three compute settings: low, medium, and high. At medium setting, the o3-mini outperforms the larger o1 model and costs less. Its latency is also lower than the o1 model. In case you are wondering why is it called o3, and not o2, well, to avoid legal issues with O2, the UK-based mobile network operator, OpenAI decided to skip o2 altogether. Finally, about availability, OpenAI says it's performing safety testing on o3 and o3-mini models. The company is also opening up the o3-mini model for public safety testing. OpenAI plans to release the o3-mini model by the end of January 2025. And after that, the o3 model will be released, after rigorous testing and approval by regulators.

[14]

Geeky Gadgets

OpenAI o3 and o3-mini Introduced - 12 Days of OpenAI: Day 12

On its 12th Day of OpenAI and its final announcement OpenAI has launched two advanced AI models, o3 and o3-mini, aimed at transforming reasoning capabilities while maintaining a strong focus on safety and cost efficiency. These models address critical challenges in artificial intelligence by combining innovative performance with a commitment to responsible deployment. Through rigorous testing and alignment strategies, OpenAI ensures these models meet the highest standards of safety and reliability. These models don't just excel in technical benchmarks -- they come equipped with features designed to adapt to diverse needs, from resource-conscious applications to high-stakes problem-solving. And with a focus on public safety testing and a novel alignment strategy, OpenAI is inviting the community to help shape the future of AI deployment. The o3 model represents a major step forward in AI reasoning, excelling in complex domains such as coding, mathematics, and scientific problem-solving. Its counterpart, o3-mini, is a more compact and cost-efficient version, tailored for applications requiring flexibility and resource-conscious solutions. Both models are designed with a focus on precision and problem-solving, but o3-mini introduces an innovative feature: adjustable reasoning levels. This allows users to balance performance with resource efficiency, making it suitable for a wide range of use cases. These models stand out not only for their technical capabilities but also for their adaptability. The o3-mini model, in particular, offers a scalable solution for developers and organizations seeking to optimize costs without sacrificing performance. This dual approach ensures that both models cater to diverse needs, from high-stakes scientific research to everyday business applications. The o3 model has set new benchmarks in AI evaluation, achieving exceptional results across various technical assessments. Its performance highlights include: These achievements underscore the models' ability to handle complex tasks with remarkable accuracy and efficiency. By excelling in these benchmarks, o3 and o3-mini establish themselves as state-of-the-art tools for reasoning and problem-solving, setting a new standard for AI performance. Explore further guides and articles from our vast library that you may find relevant to your interests in OpenAI. The o3-mini model is specifically designed for users seeking cost-effective AI solutions without compromising on quality. Its standout feature, adaptive reasoning, allows users to select reasoning levels -- low, medium, or high -- based on their specific needs. This customization ensures that the model can optimize resource usage while maintaining its effectiveness in solving problems. This adaptability makes o3-mini particularly appealing for developers and organizations with varying requirements. Whether you need a lightweight solution for routine tasks or a more robust system for complex challenges, the o3-mini model provides the flexibility to meet those demands. By offering a scalable approach to AI reasoning, OpenAI ensures that its technology remains accessible and practical for a broad audience. OpenAI has introduced a novel safety strategy called "deliberative alignment," using the reasoning capabilities of o3 and o3-mini to enhance their safety protocols. This approach focuses on identifying unsafe prompts and establishing clear boundaries to prevent misuse. A key component of this initiative is public safety testing, where researchers are invited to evaluate the models' behavior and provide feedback. Applications for participation in this program are open until January 10, reflecting OpenAI's commitment to transparency and collaboration. By involving the research community, OpenAI aims to refine its models and ensure they operate within safe and ethical guidelines. This proactive approach underscores the importance of safety in the development and deployment of advanced AI systems. Both o3 and o3-mini come equipped with advanced API functionalities designed to streamline integration into various applications. Key features include: These features make it easier for developers to incorporate the models into diverse workflows, from software development to data analysis. By focusing on usability, OpenAI aims to expand the practical applications of its AI technologies, making sure they are accessible to a wider audience. This emphasis on seamless integration highlights the versatility of o3 and o3-mini in addressing real-world challenges. OpenAI is actively collaborating with the ARC Prize Foundation and other research organizations to develop robust benchmarks for evaluating AI progress. These partnerships aim to establish rigorous standards that ensure advancements in AI are both measurable and meaningful. By working with leading experts, OpenAI seeks to drive innovation while maintaining accountability and transparency. This collaborative effort reflects OpenAI's broader mission to foster a responsible AI ecosystem. By creating benchmarks that are both challenging and fair, the organization ensures that future developments in AI are aligned with societal needs and expectations. These partnerships also encourage the exchange of ideas, promoting a culture of innovation and shared responsibility. The o3-mini model is set for release by the end of January, with the full o3 model following shortly thereafter. OpenAI plans to continue working closely with researchers and developers to refine these models and address emerging challenges. By prioritizing safety, performance, and accessibility, OpenAI reinforces its commitment to responsible AI innovation. Are you a researcher or developer passionate about shaping the future of AI? OpenAI invites you to participate in its public safety testing program. Your insights will play a vital role in evaluating and improving these models, making sure they are both effective and secure. Applications are open now -- your contributions could help define the next chapter in AI development.

[15]

New Scientist

OpenAI's o3 model aced a test of AI reasoning - but it's still not AGI

The latest AI model from OpenAI achieved a new record score on the $1 million ARC prize - but that doesn't mean it has reached the level of artificial general intelligence OpenAI's new o3 artificial intelligence model has achieved a breakthrough high score on a prestigious AI reasoning test called the ARC Challenge, inspiring some AI fans to speculate that o3 has achieved artificial general intelligence (AGI). But even as ARC Challenge organisers described o3's achievement as a major milestone, they also cautioned that it has not won the competition's grand prize - and it is only one step on the path toward AGI, a term for hypothetical future AI with human-like intelligence. The o3 model is the latest in a line of AI releases that follow on from the large language models powering ChatGPT. "This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models," said François Chollet, an engineer at Google and the main creator of the ARC Challenge, in a blog post. Chollet designed the Abstraction and Reasoning Corpus (ARC) Challenge in 2019 to test how well AIs can find correct patterns linking pairs of coloured grids. Such visual puzzles are intended to make AIs demonstrate a form of general intelligence with basic reasoning capabilities. But throwing enough computing power at the puzzles could let even a non-reasoning program simply solve them through brute force. To prevent this, the competition also requires official score submissions to meet certain limits on computing power. OpenAI's newly announced o3 model - which is scheduled for release in early 2025 - achieved its official breakthrough score of 75.7 per cent on the ARC Challenge's "semi-private" test, which is used for ranking competitors on a public leaderboard. The computing cost of its achievement was approximately $20 for each visual puzzle task, meeting the competition's limit of less than $10,000 total. However, the harder "private" test that is used to determine grand prize winners has an even more stringent computing power limit, equivalent to spending just 10 cents on each task, which OpenAI did not meet. The o3 model also achieved an unofficial score of 87.5 per cent by applying approximately 172 times more computing power than it did on the official score. For comparison, the typical human score is 84 per cent, and an 85 per cent score is enough to win the ARC Challenge's $500,000 grand prize - if the model can also keep its computing costs within the required limits. But to reach its unofficial score, o3's cost soared to thousands of dollars spent solving each task. OpenAI requested that the challenge organisers not publish the exact computing costs. No, the ARC challenge organisers have specifically said they do not consider beating this competition benchmark to be an indicator of having achieved AGI. The o3 model also failed to solve more than 100 visual puzzle tasks, even when OpenAI applied a very large amount of computing power toward the unofficial score, said Mike Knoop, an ARC Challenge organiser at software company Zapier, in a social media post on X. In a social media post on Bluesky, Melanie Mitchell at the Santa Fe Institute in New Mexico said the following about o3's progress on the ARC benchmark: "I think solving these tasks by brute-force compute defeats the original purpose." "While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI - there's still a fair number of very easy [ARC Challenge] tasks that o3 can't solve," said Chollet in another X post. However, Chollet described how we might know when human-level intelligence has been demonstrated by some form of AGI. "You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible," he said. Thomas Dietterich at Oregon State University suggests another way to recognise AGI. "Those architectures claim to include all of the functional components required for human cognition," he says. "By this measure, the commercial AI systems are missing episodic memory, planning, logical reasoning, and most importantly, meta-cognition." The o3 model's high score comes as the tech industry and AI researchers have been reckoning with a slower pace of progress in the latest AI models for 2024, compared with the initial explosive developments of 2023. Although it did not win the ARC Challenge, o3's high score indicates that AI models could beat the competition benchmark in the near future. Beyond its unofficial high score, Chollet says many official low-compute submissions have already scored above 81 per cent on the private evaluation test set. Dietterich also thinks that "this is a very impressive leap in performance". However, he cautions that, without knowing more about how OpenAI's o1 and o3 models work, it is impossible to evaluate just how impressive the high score is. For instance, if o3 was able to practice the ARC problems in advance, then that would make its achievement easier. "We will need to await an open source replication to understand the full significance of this," says Dietterich. The ARC Challenge organisers are already looking to launch a second and more difficult set of benchmark tests sometime in 2025. They will also keep the ARC Prize 2025 challenge running until someone achieves the grand prize and open sources their solution.

[16]

SiliconANGLE

OpenAI details o3 reasoning model with record-breaking benchmark scores - SiliconANGLE

OpenAI details o3 reasoning model with record-breaking benchmark scores OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model's introduction caps off a 12-day product announcement series that started with the launch of a new ChatGPT plan. ChatGPT Pro, as the $200 per month subscription is called, features a predecessor of the new o3 LLM. OpenAI also released its Sora video generator and a bevy of smaller product updates. The company has not detailed how o3 works or when it might become available to customers. However, it did release results from a series of benchmarks that evaluated how well o3 performs various reasoning tasks. Compared with earlier LLMs, the model demonstrated significant improvements across the board. The perhaps most notable of the benchmarks that OpenAI used is called ARC-AGI-1. It tests how well a neural network performs tasks that it was not specifically trained to perform. This kind of versatility is seen as a key requisite to creating artificial general intelligence, or AGI, a hypothetical future AI that can perform many tasks with the same accuracy as humans. Using a relatively limited amount of computing power, o3 scored 75.7% on ARC-AGI-1. That percentage grew to 87.5% when the model was given access to more infrastructure. GPT-3, the LLM that powered the original version of ChatGPT, scored 0% while the GPT-4o model released earlier this year managed 5%. "Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet," François Chollet, the developer of the benchmark, wrote in a blog post. "You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible." OpenAI says that o3 also achieved record-breaking performance in the Frontier Math test, one of the most difficult AI evaluation benchmarks on the market. It comprises several hundred advanced mathematical problems that were created with input from more than 60 mathematicians. OpenAI says that o3 solved 25.2% of the problems in the test, easily topping the previous high score of about 2%. Programming is another use case to which the LLM can be applied. According to OpenAI, o3 outperformed the previous-generation o1 model on the SWE-Bench Verified benchmark by 22.8%. The benchmark includes questions that challenge AI models to find and fix a bug in a code repository based on a natural language description of the problem. OpenAI detailed that 3o is available in two flavors: a full-featured edition called simply o3 and o3-mini. The latter release is presumably a lightweight version that trades off some output quality for faster response times and lower inference costs. The previous-generation o1 model is also available in such a scaled-down edition. Initially, OpenAI is only making o3 accessible to a limited number of AI safety and cybersecurity researchers. Their feedback will help the company improve the safety of the model before it's made more broadly available. In a blog post, OpenAI detailed that it built o3 using a new technique for preventing harmful output. The method, deliberative alignment, allows researchers to supply AI models with a set of human-written safety guidelines. It works by embedding those guidelines into the training dataset with which an LLM is developed.

[17]

Ars Technica

OpenAI announces o3 and o3-mini, its next simulated reasoning models

On Friday, during Day 12 of its "12 days of OpenAI," OpenAI CEO Sam Altman announced its latest AI "reasoning" models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today. The models use what OpenAI calls "private chain of thought," where the model pauses to examine its internal dialog and plan ahead before responding, which you might call "simulated reasoning" (SR) -- a form of AI that goes beyond basic large language models (LLMs). The company named the model family "o3" instead of "o2" to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday's livestream, Altman acknowledged his company's naming foibles, saying, "In the grand tradition of OpenAI being really, truly bad at names, it'll be called o3." According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent -- comparable to human performance at an 85 percent threshold. OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.

[18]

FoneArena

OpenAI unveils o3 and o3 mini AI reasoning models

OpenAI marked the conclusion of its "12 Days of OpenAI" announcements with the introduction of two advanced reasoning models: o3 and o3 mini. These models succeed the earlier o1 reasoning model, released earlier this year. Interestingly, OpenAI skipped "o2" to avoid potential conflicts or confusion with the British telecom company O2. The o3 model establishes a new benchmark for reasoning and intelligence, outperforming its predecessor across various domains: The ARC-AGI benchmark evaluates generalized intelligence by testing a model's ability to solve novel problems without relying on memorized patterns. With this achievement, OpenAI describes the o3 model as a significant step toward Artificial General Intelligence (AGI). The o3 mini offers a distilled version of o3, optimized for efficiency and affordability: OpenAI introduced deliberative alignment, a novel training paradigm aimed at improving safety by incorporating structured reasoning aligned with human-written safety standards. Key aspects include: Deliberative alignment employs both process-based and outcome-based supervision: Results: The first version of the o3 model will be released in early 2025. OpenAI has invited safety and security researchers to apply for early access, with applications closing on January 10, 2025. Selected researchers will be notified shortly after. Participants in the program will: OpenAI continues to prioritize safety research as reasoning models become increasingly sophisticated. This initiative aligns with its ongoing collaborations with organizations such as the U.S. and UK AI Safety Institutes, ensuring advancements in AI remain secure and beneficial.

[19]

TechRadar

12 Days of OpenAI ends with a new model for the new year

Both models handily outperform existing AI models and will roll out in the next few months. The final day of the 12 Days of OpenAI, brought back OpenAI CEO Sam Altman to show off a brand new set of AI models coming in the new year. The o3 and o3-mini models are enhanced versions of the relatively new o1 and o1-mini models. They're designed to think before they speak, reasoning out their answers. The mini version is smaller and aimed more at carrying out a limited set of specific tasks but with the same approach. OpenAI is calling it a big step toward artificial general intelligence (AGI), which is a pretty bold claim for what is, in some ways, a mild improvement to an already powerful model. You might have noticed there's a number missing between the current o1 and the upcoming o3 model. According to Altman, that's because OpenAI wants to avoid any confusion with British telecom company O2. So, what makes o3 special? Unlike regular AI models that spit out answers quickly, o3 takes a beat to reason things out. This "private chain of thought" lets the model fact-check itself before responding, which helps it avoid some of the classic AI pitfalls, like confidently spewing out wrong answers. This extra thinking time can make o3 slower, even if only a little bit, but the payoff is better accuracy, especially in areas like math, science, and coding. One great aspect of the new models is that you can adjust that extra thinking time manually. If you're in a hurry, you can set it to "low compute" for quick responses. But if you want top-notch reasoning, crank it up to "high compute" and give it a little more time to mull things over. In tests, o3 has easily outstripped its predecessor. This is not quite AGI; o3 can't take over for humans in every way. It also does not reach OpenAI's definition of AGI, which describes models that outperform humans in the most economically valuable projects. Still, should OpenAI reach that goal, things get interesting for its partnership with Microsoft since that would end OpenAI's obligation to give Microsoft exclusive access to the most advanced AI models. Right now, o3 and its mini counterpart aren't available to everyone. OpenAI is giving safety researchers a sneak peek via Copilot Labs, and the rest of us can expect the o3-mini model to drop in late January, with the full o3 following soon after. It's a careful, measured rollout, which makes sense given the kind of power and complexity we're talking about here. Still, o3 gives us a glimpse of where things are headed: AI that doesn't just generate content but actually thinks through problems. Whether it gets us to AGI or not, it's clear that smarter, reasoning-driven AI is the next frontier. For now, we'll just have to wait and see if o3 lives up to the hype or if this last gift from OpenAI is just a disguised lump of coal.

[20]

TechCrunch

OpenAI's o3 suggests AI models are scaling in new ways -- but so are the costs | TechCrunch

Last month, AI founders and investors told TechCrunch that we're now in the "second era of scaling laws," noting how established methods of improving AI models were showing diminishing returns. One promising new method they suggested could keep gains was "test-time scaling," which seems to be what's behind the performance of OpenAI's o3 model - but it comes with drawbacks of its own. Much of the AI world took the announcement of OpenAI's o3 model as proof that AI scaling progress has not "hit a wall." The o3 model does well on benchmarks, significantly outscoring all other models on a test of general ability called ARC-AGI, and scoring 25% on a difficult math test that no other AI model scored more than 2% on. Of course, we at TechCrunch are taking all this with a grain of salt until we can test o3 for ourselves (very few have tried it so far). But even before o3's release, the AI world is already convinced that something big has shifted. The co-creator of OpenAI's o-series of models, Noam Brown, noted on Friday that the startup is announcing o3's impressive gains just three months after the startup announced o1 - a relatively short timeframe for such a jump in performance. "We have every reason to believe this trajectory will continue," said Brown in a tweet. Anthropic co-founder Jack Clark said in a blog post on Monday that o3 is evidence that AI "progress will be faster in 2025 than in 2024." (Keep in mind that it benefits Anthropic - especially its ability to raise capital - to suggest that AI scaling laws are continuing, even if Clark is complementing a competitor.) Next year, Clark says the AI world will splice together test-time scaling and traditional pre-training scaling methods to eke even more returns out of AI models. Perhaps he's suggesting that Anthropic and other AI model providers will release reasoning models of their own in 2025, just like Google did last week. Test-time scaling means OpenAI is using more compute during ChatGPT's inference phase, the period of time after you press enter on a prompt. It's not clear exactly what is happening behind the scenes: OpenAI is either using more computer chips to answer a user's question, running more powerful inference chips, or running those chips for longer periods of time - 10 to 15 minutes in some cases - before the AI produces an answer. We don't know all the details of how o3 was made, but these benchmarks are early signs that test-time scaling may work to improve the performance of AI models. While o3 may give some a renewed belief in the progress of AI scaling laws, OpenAI's newest model also uses a previously unseen level of compute, which means a higher price per answer. "Perhaps the only important caveat here is understanding that one reason why O3 is so much better is that it costs more money to run at inference time - the ability to utilize test-time compute means on some problems you can turn compute into a better answer," Clark writes in his blog. "This is interesting because it has made the costs of running AI systems somewhat less predictable - previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output." Clark, and others, pointed to o3's performance on the ARC-AGI benchmark - a difficult test used to assess breakthroughs on AGI - as an indicator of its progress. It's worth noting that passing this test, according to its creators, does not mean an AI model has achieved AGI, but rather it's one way to measure progress towards the nebulous goal. That said, the o3 model blew past the scores of all previous AI models which had done the test, scoring 88% in one of its attempts. OpenAI's next best AI model, o1, scored just 32%. But the logarithmic x-axis on this chart may be alarming to some. The high-scoring version of o3 used more than $1000 worth of compute for every task. The o1 models used around $5 of compute per task, and o1-mini used just a few cents. The creator of the ARC-AGI benchmark, François Chollet, writes in a blog that OpenAI used roughly 170x more compute to generate that 88% score, compared to high-efficiency version of o3 that scored just 12% lower. The high-scoring version of o3 used more than $10,000 of resources to complete the test, which makes it too expensive to compete for the ARC Prize - an unbeaten competition for AI models to beat the ARC test. However, Chollet says o3 was still a breakthrough for AI models, nonetheless. "o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain," said Chollet in the blog. "Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy." It's premature to harp on the exact pricing of all this - we've seen prices for AI models plummet in the last year, and OpenAI has yet to announce how much o3 will actually cost. However, these prices indicate just how much compute is required to break, even slightly, the performance barriers set by leading AI models today. This raises some questions. What is o3 actually for? And how much more compute is necessary to make more gains around inference with o4, o5, or whatever else OpenAI names its next reasoning models? It doesn't seem like o3, or its successors, would be anyone's "daily driver" like GPT-4o or Google Search might be. These models just use too much compute to answer small questions throughout your day such as, "How can the Cleveland Browns still make the 2024 playoffs?" Instead, it seems like AI models with scaled test-time compute may only be good for big picture prompts such as, "How can the Cleveland Browns become a Super Bowl franchise in 2027?" Even then, maybe it's only worth the high compute costs if you're the general manager of the Cleveland Browns, and you're using these tools to make some big decisions. Institutions with deep pockets may be the only ones who can afford o3, at least to start, as Wharton professor Ethan Mollick notes in a tweet. We've already seen OpenAI release a $200 tier to use a high-compute version of o1, but the startup has reportedly weighed creating subscription plans costing up to $2,000. When you see how much compute o3 uses, you can understand why OpenAI would consider it. But there are drawbacks to using o3 for high-impact work. As Chollet notes, o3 is not AGI, and it still fails on some very easy tasks that a human would do quite easily. This isn't necessarily surprising, as large language models still have a huge hallucination problem, which o3 and test-time compute don't seem to have solved. That's why ChatGPT and Gemini include disclaimers below every answer they produce, asking users not to trust answers at face value. Presumably AGI, should it ever be reached, would not need such a disclaimer. One way to unlock more gains in test-time scaling could be better AI inference chips. There's no shortage of startups tackling just this thing, such as Groq or Cerebras, while other startups are designing more cost-efficient AI chips, such as MatX. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch he expects these startups to play a bigger role in test-time scaling moving forward. While o3 is a notable improvement to the performance of AI models, it raises several new questions around usage and costs. That said, the performance of o3 does add credence to the claim that test-time compute is the tech industry's next best way to scale AI models.

[21]

Mashable

OpenAI announces o3 and o3 mini reasoning models

In the livestream, SVP of Research Mark Chen showed o3's performance on certain benchmarks, compared to o1, like competition math (96.7 percent) and PhD-level science (87.7 percent). OpenAI and the ARC Prize competition also shared how o3 scored 76 percent on the ARC-AGI benchmark, which includes novel unpublished datasets. The ARC-AGI benchmark is designed to test ability to learn new and distinct skills on the fly with every new task. The announcement caps the 12 Days of OpenAI marathon, which debuted something new everyday. Over the past 12 business days, OpenAI has launched its AI video generator Sora, vision with Advanced Voice Mode, in addition to a slew of products and features designed to make ChatGPT more seamless to use in work and daily life. The o3 mini model is designed to be a cost-efficient model that balances performance. It has three different effort levels and cap adapt its amount of reasoning time based on the difficulty of the problem. "An incredible cost-to-performance gain," said CEO Sam Altman. So, o3 and o3 mini have achieved amazing intelligence breakthroughs according to OpenAI. But they're not ready to be released to the public yet. But OpenAI is granting early access to o3 and o3 mini for safety testing starting today. Applications to join the model testing program are accepted on a rolling basis and close on Jan. 10.

[22]

Interesting Engineering

o3 and o3 mini: OpenAI's new reasoning AI models enter trial phase

The o3 model scored 96.7% accuracy on the AIME 2024 math competition, missing only one question, and achieved 87.7% on GPQA Diamond for scientific reasoning, outperforming typical PhD-level experts at 70%. A standout achievement for o3 was solving 25.2% of problems on EpochAI's Frontier Math benchmark, a huge jump from the previous model's 2% accuracy. It also scored 87.5% on the ARC-AGI benchmark, surpassing human performance in conceptual reasoning. A post on X said that "OpenAI o3 ranks 2727 on Codeforces, equivalent to the #175 best human competitive coder in the world", marking an absolutely superhuman achievement for AI and technology. Furthermore, the o3-mini is a streamlined version of o3, designed for efficiency in coding tasks. It offers strong performance with lower computational costs and adjustable reasoning settings -- low, medium, and high -- for flexibility across various tasks. The company also introduced a new safety method called deliberative alignment, using the models' reasoning skills to better identify and manage unsafe prompts. This marks a major advancement in AI safety, improving accuracy in rejecting harmful requests while avoiding the over-refusal of valid ones.

[23]

Gizmodo

OpenAI Skips o2 and Debuts New o3 'Reasoning' Model

Reasoning models are supposed to fact-check themselves by producing a step-by-step plan to find a correct answer. The final day of OpenAI's "12 Days of Shipmas" has arrived with the unveiling of o3, a new chain-of-thought "reasoning" model that the company claims is its most advanced yet. The model is not yet available for general use, but safety researchers can sign up for a preview starting today. OpenAI and others hope that reasoning models will go a long way toward solving the pernicious problem of chatbots frequently producing wrong answers. Chatbots fundamentally do not "think" like humans and different techniques are needed to try and create the best simulacrum of a human thought process. When asked a question, reasoning models pause and consider related prompts that could help produce an accurate answer. For example, if you ask the o3 model, "can habaneros be grown in the Pacific Northwest," the model might lay out a series of questions it will research to come to a conclusion, such as "where do habaneros typically grow," "what are the ideal conditions for growing habaneros," and "what type of climate does the Pacific Northwest have." Anyone who has used chatbots knows you sometimes have to prompt a chatbot with additional follow-ups until it finally gets the right result. Reasoning models are supposed to do this additional work for you. o3 is the successor to o1, OpenAI's first chain-of-thought reasoning model. Reps said they decided to skip the "o2" naming convention "out of respect" for the British telecommunications company, but it certainly doesn't hurt that it makes the product sound more advanced. The company says the new model comes with the ability to adjust its reasoning time. Users can choose low, medium, or high reasoning time; the greater the compute, the better o3 is supposed to perform. OpenAI says it will spend time "red-teaming" the new model with researchers to prevent it from producing potentially harmful responses (since again, it is not a human and does not know right versus wrong). Reasoning is the buzzword of the day in the field of generative AI, as industry insiders believe it is the next unlock necessary to improve the performance of large language models. More compute eventually does not offer equivalent performance gains, so new techniques are needed. Google DeepMind recently unveiled its own reasoning model called Gemini Deep Research, which can take 5-10 minutes to generate a report that analyzes many sources across the web in order to come to its findings. OpenAI is confident in o3, and offers impressive benchmarks -- it says that in a Codeforcing testing, which measures coding ability, o3 got a score of 2727. For context, a score of 2400 would put an engineer in the 99th percentile of programmers. It gets a score of 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question. We will have to see how the model holds up in real-world testing, and it is still generally not a good idea to rely too much on AI models for important work where accuracy is necessary. But optimists are confident that the problem of accuracy is being solved. Hopefully so, because as it stands, Google's AI Overviews in search are still the subject of frequent social media ridicule. AI model companies like OpenAI and Perplexity are in a race to become the next Google, collecting the world's knowledge and helping users make sense of it all. They even have search products now that are meant to more directly replicate Google with access to real-time web results. All of these players seem to leapfrog one another with every passing day, however. The feeling is somewhat reminiscent of the late '90s when there were a myriad of search engines to choose from -- Google, Yahoo, and AltaVista, Ask Jeeves, just to name a few, all hoovering up the internet's data and presenting it just with a different UX. Most of them disappeared after one came along that was supremely better than the rest -- Google. OpenAI clearly has a strong lead right now with hundreds of millions of monthly active users and a partnership with Apple, but Google has received a lot of plaudits recently for advancements in its Gemini models. The Verge reports that the company is going to soon integrate Gemini more deeply into its search interface.

[24]

Gadgets 360

OpenAI Unveils 'o3' Reasoning AI Models in Test Phase

OpenAI is testing new o3 and o3 mini AI models These AI models can tackle complex problems OpenAI released its first o1 model in September OpenAI said on Friday it was testing new reasoning AI models, o3 and o3 mini, in a sign of growing competition with rivals such as Google to create smarter models capable of tackling complex problems. CEO Sam Altman said the AI startup plans to launch o3 mini by the end of January, and full o3 after that, as more robust large language models could outperform existing models and attract new investments and users. Microsoft-backed OpenAI released o1 AI models in September designed to spend more time processing queries to solve hard problems. The o1 models are capable of reasoning through complex tasks and can solve more challenging problems than previous models in science, coding and math, the AI firm had said in a blog post. OpenAI's new o3 and o3 mini models, which are in internal safety testing currently, will be more powerful than its previously launched o1 models, the company said. The GenAI pioneer said it was opening up an application process for external researchers to test o3 models ahead of the public release, which will close on Jan. 10. OpenAI had triggered an AI arms race after it launched ChatGPT in November 2022. The growing popularity of the company and new product launches helped OpenAI in closing a $6.6 billion funding round in October. Rival Alphabet's Google released the second generation of its AI model Gemini earlier in December, as the search giant aims to reclaim the lead in the AI technology race.

[25]

Quartz

OpenAI unveils o3, its next 'reasoning' model

OpenAI ended its "12 Days of OpenAI" product-launch spree by unveiling the successor to its first "reasoning" model. The new frontier model family includes o3 and o3-mini, the artificial intelligence startup said Friday. Neither model is being publicly launched yet, but they are now available for public safety testing. "We view this as sort of the beginning of the next phase of AI, where you can use these models to do increasingly complex tasks that require a lot of reasoning," OpenAI chief executive Sam Altman said during a livestreamed announcement. The AI startup is skipping the 02 name, Altman said, "out of respect to our friends at Telefónica (TEF+0.25%), and in the grand tradition of OpenAI being really, truly bad at names." O2, a brand of Spain's Telefónica, is a mobile network operator in the U.K. For the first time, OpenAI is opening the models for external safety testing. Safety and security researchers can sign up to preview and test the models, Altman said, adding that the startup plans to launch o3-mini around the end of January, followed by the full o3 model shortly after. Compared to o1 and o1-mini, which launched in September, o3 outperformed o1 by almost 23 percentage points on OpenAI's own SWE-Bench Verified evaluation, and reached a Codeforces rating of 2727, it said. Meanwhile, OpenAI's chief scientist scored 2665, according to the startup. The new model also set a record on EpochAI's Frontier Math evaluation, OpenAI said, and apparently more than tripled o1's score on the ARC-AGI test. OpenAI launched the full version of its o1 model out of preview during the first day of its "12 Days of OpenAI" promotional scheme. The startup also announced a new, $200-a-month subscription tier for ChatGPT called ChatGPT Pro, which includes a more advanced version of o1 called o1 pro mode.

[26]

Wired

OpenAI Upgrades Its Smartest AI Model With Improved Reasoning Skills

OpenAI today announced an improved version of its most capable artificial intelligence model to date -- one that takes even more time to deliberate over questions -- just a day after Google announced its first model of this type. OpenAI's new model, called o3, replaces o1, which the company introduced in September. Like o1, the new model spends time ruminating over a problem in order to deliver better answers to questions that require step-by-step logical reasoning. The o3 model scores much higher on several measures than its predecessor, OpenAI says, including ones that measure complex coding-related skills and advanced math and science competency. It is three times better than o1 at answering questions posed by ARC-AGI, a benchmark designed to test an AI models' ability to reason over problems they're encountering for the first time. Google is pursuing a similar line of research. Noam Shazeer, a Google researcher, yesterday revealed in a post on X that the company has developed its own reasoning model, called Gemini 2.0 Flash Thinking. Google's CEO, Sundar Pichai, called it "our most thoughtful model yet" in his own post. The two dueling models show competition between OpenAI and Google to be fiercer than ever. It is crucial for OpenAI to demonstrate that it can keep making advances as it seeks to attract more investment and build a profitable business. Google is meanwhile desperate to show that it remains at the forefront of AI research. The new models also show how AI companies are increasingly looking beyond simply scaling up AI models in order to wring greater intelligence out of them. Large language models can answer many questions remarkably well, but they often stumble when asked to solve puzzles that require basic math or logic. OpenAI's o1 incorporates training on step-by-step problem-solving that makes an AI model better able to tackle these types of problems. Models that reason over problems will also be important as companies seek to deploy so-called AI agents that can reliably figure out how to solve complex problems on a users' behalf. The o3 model is 20 percent better than o1 at a SWE-Bench, a test that measures a models' agentic abilities. While a true breakthrough moment has eluded tech giants at the end of the year, the pace of AI announcements has been dizzying of late. Early this month Google announced a new version of its flagship model, called Gemini 2.0, and demonstrated it as a web browsing helper and as an assistant that sees the world through a smartphone or a pair of smart glasses. OpenAI has made numerous announcements in the run up to Christmas, including a new version of its video-generating model, a free version of its ChatGPT-powered search engine, and a way to access ChatGPT over the phone by calling 1-800-ChatGPT.

[27]

Reuters

OpenAI unveils 'o3' reasoning AI models in test phase

Dec 20 (Reuters) - OpenAI said on Friday it was testing new reasoning AI models, o3 and o3 mini, in a sign of growing competition with rivals such as Google to create smarter models capable of tackling complex problems. CEO Sam Altman said the AI startup plans to launch o3 mini by the end of January, and full o3 after that, as more robust large language models could outperform existing models and attract new investments and users. Microsoft-backed (MSFT.O), opens new tab OpenAI released o1 AI models in September designed to spend more time processing queries to solve hard problems. The o1 models are capable of reasoning through complex tasks and can solve more challenging problems than previous models in science, coding and math, the AI firm had said in a blog post. OpenAI's new o3 and o3 mini models, which are in internal safety testing currently, will be more powerful than its previously launched o1 models, the company said. The GenAI pioneer said it was opening up an application process for external researchers to test o3 models ahead of the public release, which will close on Jan. 10. OpenAI had triggered an AI arms race after it launched ChatGPT in November 2022. The growing popularity of the company and new product launches helped OpenAI in closing a $6.6 billion funding round in October. Rival Alphabet's (GOOGL.O), opens new tab Google released the second generation of its AI model Gemini earlier in December, as the search giant aims to reclaim the lead in the AI technology race. Reporting by Jaspreet Singh in Bengaluru; Editing by Vijay Kishore Our Standards: The Thomson Reuters Trust Principles., opens new tab Suggested Topics:Artificial Intelligence

[28]

Market Screener

OpenAI unveils 'o3' reasoning AI models in test phase

(Reuters) - OpenAI said on Friday it was testing new reasoning AI models, o3 and o3 mini, in a sign of growing competition with rivals such as Google to create smarter models capable of tackling complex problems. CEO Sam Altman said the AI startup plans to launch o3 mini by the end of January, and full o3 after that, as more robust large language models could outperform existing models and attract new investments and users. Microsoft-backed OpenAI released o1 AI models in September designed to spend more time processing queries to solve hard problems. The o1 models are capable of reasoning through complex tasks and can solve more challenging problems than previous models in science, coding and math, the AI firm had said in a blog post. OpenAI's new o3 and o3 mini models, which are in internal safety testing currently, will be more powerful than its previously launched o1 models, the company said. The GenAI pioneer said it was opening up an application process for external researchers to test o3 models ahead of the public release, which will close on Jan. 10. OpenAI had triggered an AI arms race after it launched ChatGPT in November 2022. The growing popularity of the company and new product launches helped OpenAI in closing a $6.6 billion funding round in October. Rival Alphabet's Google released the second generation of its AI model Gemini earlier in December, as the search giant aims to reclaim the lead in the AI technology race. (Reporting by Jaspreet Singh in Bengaluru; Editing by Vijay Kishore)

[29]

Tom's Guide

OpenAI's next generation model could be announced today -- here's what we know

OpenAI could be about to announce its most powerful AI model to date on the final day of its 12 Days of OpenAI extravaganza. The new model is the next generation "reasoner", replacing o1, released last week. A report from The Information suggests it could be called o3, skipping o2 as that is the name of a telecom company in the UK. This was backed up by a cryptic post on X from OpenAI CEO Sam Altman, stating "Should have said oh oh oh." We don't know anything about the new model but some experts suggest it could be capable of solving evaluation tests designed to determine whether a model is Artificial General Intelligence (AGI), something all AI labs are working towards. Declaring a model as AGI (something Altman said would be achieved in 2025) would allow OpenAI to terminate its long-standing contract with Microsoft and renegotiate more favorable terms with the tech giant. We will find out at 1pm ET (6pm GMT, 5am ACT) whether this is one giant leap for AI or a small step for ChatGPT. This new announcement could be a significant step forward in AI reasoning, an area where OpenAI's o series has already distinguished itself. However, even there it is facing competition from Google -- with its new Gemini Flash Thinking model. OpenAI's o series of models work slightly differently from the GPT family, such as GPT-4o. For example, o1 is focused on reasoning and problem-solving, using chain-of-thought to solve problems. In contrast, GPT-4o is trained to process and generate multiple modalities through a unified neural network. Part of what makes o1 different is the switch to focusing on post-training. OpenAI is using tools to improve how they handle specific tasks and learn from problems. This is known as reinforcement learning, something they confirmed would be available to developers during day 2 of the 12-Days event. Altman has previously hinted at a merging of the capabilities, thought to be the rumored "Orion" project. It isn't clear whether today's announcement will be that but with a new name, or whether o3 is simply a better version of o1. We seem to be skipping o2 completely due to the British telecom company of the same name. o2 even announced an AI project recently, using generative speech to keep scammers on the phone with a virtual old lady.

[30]

AIM

OpenAI Set to Release o3 Soon

OpenAI is reportedly developing its next-generation o1 reasoning model, which will take more time to "think" about user queries before responding. However, the company is considering skipping the name "o2" due to potential copyright or trademark conflicts with O2, a British telecommunications service provider. As a result, OpenAI is reportedly contemplating calling the update "o3," with some leaders already referring to the model internally as such. Recently, OpenAI chief Sam Altman posted on X, "Fine one clue, should have said oh oh oh," he posted on X, clearly hinting about the release of the next model. During the ongoing '12 Days of OpenAI,' the company released the full version of the o1 model. Besides this, it also released the o1 model in the API, upgraded with function calling, structured outputs, reasoning effort controls, developer messages, and vision inputs. Meanwhile, Google recently released Gemini 2.0 Flash Thinking. The new model comes with advanced reasoning capabilities, alongside showcasing its thoughts. Logan Kilpatrick, senior product manager at Google, said that the model "unlocks stronger reasoning capabilities and shows its thoughts". According to Kilpatrick, the model can "solve complex problems with Flash speeds" while displaying its internal planning process and allowing for greater transparency in AI problem-solving. The experimental model is still in its early stages. Kilpatrick, however, provided an example of its potential and showcased how it can tackle a challenging puzzle involving both visual and textual clues. Developers can try the model out in Google AI Studio and the Gemini API. "This is just the first step in our reasoning journey, [I'm] excited to see what you all think!" Kilpatrick added. With only one day remaining in OpenAI 'Shipmas', everyone is eagerly anticipating what OpenAI will unveil next to wrap up 12 days of nonstop shipping.

[31]

Benzinga

ChatGPT Maker OpenAI Drops o3 Reasoning Model As o1's Successor: Greg Brockman Calls It A 'Breakthrough' - Alphabet (NASDAQ:GOOG), Alphabet (NASDAQ:GOOGL)

On Friday, ChatGPT-maker OpenAI unveiled its next-generation reasoning model, o3, as the successor to o1, capping off the company's 12 days of shipmas announcements. What Happened: On the final day of its "12 Days of OpenAI" event, the AI startup revealed o3, which brings significant advancements in reasoning and coding capabilities. Alongside o3, Microsoft Corp.-backed MSFT OpenAI also introduced o3-mini, a smaller version designed for specific, more targeted applications. See Also: SoundHound AI Soars 700%: Short Squeeze Potential Has Traders Talking "We view this as the beginning of the next phase of AI. Where you can use these models to do increasingly complex tasks that require a lot of reasoning," stated OpenAI CEO Sam Altman. The company has begun rolling out o3 to select safety researchers, with plans for broader availability soon. The application window closes on Jan. 10. Altman also stated that the o3-mini may launch by the end of January, followed by the full release of the o3 model. Subscribe to the Benzinga Tech Trends newsletter to get all the latest tech developments delivered to your inbox. Notably, OpenAI skipped over naming the new model "o2," to avoid potential trademark conflicts with British telecom provider O2, according to The Information. Why It Matters: OpenAI's progress is evident in the ARC-AGI benchmark, where o3 scored 87.5% on the high compute setting, tripling o1's performance on the lower setting. The ARC-AGI test is often used to measure progress toward achieving Artificial General Intelligence (AGI) by assessing a model's performance in areas that require advanced problem-solving and reasoning. However, ARC-AGI co-creator François Chollet also noted that o3 struggles with "very easy tasks" in ARC-AGI, suggesting it has "fundamental differences" from human intelligence. "While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI -- there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3," Chollet stated. As part of its "12 Days of OpenAI" event, the AI startup rolled out the ChatGPT Search engine to all users, including those on the free tier. The AI startup also unveiled its text-to-image model, Sora, and introduced a $200 monthly ChatGPT Pro subscription. Earlier this month, Alphabet Inc. GOOG GOOGL also launched the second generation of its Gemini AI model to regain its position at the forefront of the AI technology race. Check out more of Benzinga's Consumer Tech coverage by following this link. Read Next: Micron Gets Investment From White House, China Launches Antitrust Investigation On Nvidia, Apple Faces $1.2 Billion Lawsuit In California & More: Consumer Tech News Disclaimer: This content was partially produced with the help of Benzinga Neuro and was reviewed and published by Benzinga editors. Photo courtesy: Shutterstock Market News and Data brought to you by Benzinga APIs

[32]

VentureBeat

OpenAI confirms new frontier models o3 and o3-mini

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has just confirmed that is releasing a new reasoning model named o3 and o3 mini, a successor to the o1 and o1 mini models that just entered full release earlier this month. CEO Sam Altman just said it would be released to researchers for safety testing today in OpenAI's final day of "12 Days of OpenAI" livestreams. Altman also said the o3 model was "incredible at coding" and the benchmarks shared by OpenAI support it, exceeding even o1's performance on programming tasks. Watch the livestream and stay tuned for more as we learn it:

[33]

Investing.com

OpenAI testing new AI models o3 and o3 mini amid competition By Investing.com

Investing.com -- OpenAI, the artificial intelligence firm, announced on Friday that it is currently testing its new reasoning AI models, named o3 and o3 mini. This development signals an intensifying competition in the AI industry, with rivals like Google also striving to develop more advanced models capable of handling complex problems. CEO Sam Altman shared the company's plan to launch the o3 mini model by the end of January, followed by the full o3 model. These larger, more robust language models are expected to outperform existing models, potentially attracting new investments and users. The company's previous o1 models have demonstrated the ability to reason through complex tasks and solve more challenging problems in areas such as science, coding, and math, according to a blog post by OpenAI. The upcoming o3 and o3 mini models, currently undergoing internal safety testing, are expected to be more powerful than the o1 models. In an effort to engage external researchers in the testing process, OpenAI is initiating an application process for those interested in testing the o3 models before their public release. The application process will close on January 10. OpenAI has been a key player in the AI industry since it launched ChatGPT in November 2022, triggering what could be seen as an AI arms race. The company's growing popularity and consistent product launches have been successful in attracting significant funding, closing a $6.6 billion funding round in October. Meanwhile, Alphabet (NASDAQ:GOOGL)'s Google, another major competitor in the AI space, released the second generation of its AI model, Gemini, earlier in December, as part of its efforts to regain the lead in the AI technology race.

[34]

NYT

OpenAI Unveils New A.I. That Reasons Through Math, Science Problems

OpenAI on Friday unveiled a new artificial intelligence system, OpenAI o3, which is designed to "reason" through problems involving math, science and computer programming. The company said that the system, which it is currently sharing only with safety and security testers, outperformed the industry's leading A.I. technologies on standardized benchmark tests that rate skills in math, science, coding and logic. The new system is the successor to o1, the reasoning system that the company introduced earlier this year. OpenAI o3 was more accurate than o1 by over 20 percent in a series of common programming tasks, the company said, and it even outperformed its chief scientist, Jakub Pachocki, on a competitive programming test. OpenAI said it plans to roll the technology out to individuals and businesses early next year. "This model is incredible at programming," said Sam Altman, OpenAI's chief executive, during an online presentation to reveal the new system. He added that at least one OpenAI programmer could still beat the system on this test. The new technology is part of a wider effort to build A.I. systems that can reason through complex tasks. Earlier this week, Google unveiled similar technology, called Gemini 2.0 Flash Thinking Experimental, and shared it with a small number of testers. These two companies and others aim to build systems that can carefully and logically solve a problem through a series of steps, each one building on the last. These technologies could be useful to computer programmers who use A.I. systems to write code or to students seeking help from automated tutors in areas like math and science. With the debut of the ChatGPT chatbot in late 2022, OpenAI showed that machines could handle requests more like people, answering questions, writing term papers and generating computer code. But the responses were sometimes flawed. ChatGPT learned its skills by analyzing enormous amounts of text culled from across the internet, including news articles, books, computer programs and chat logs. By pinpointing patterns, it learned to generate text on its own. Because the internet is filled with untruthful information, the technology learned to repeat the same untruths. Sometimes, it made things up -- a phenomenon that scientists called "hallucination." OpenAI built its new system using what is called "reinforcement learning." Through this process, a system can learn behavior through extensive trial and error. By working through various math problems, for instance, it can learn which techniques lead to the right answer and which do not. If it repeats this process with a very large number of problems, it can identify patterns. Though systems like o3 are designed to reason, they are based on the same core technology as the original ChatGPT. That means they may still get things wrong or hallucinate. The system is designed to "think" through problems. It tries to break the problem down into pieces and look for ways to solve it, which can require much larger amounts of computing power than is needed for ordinary chatbots. That can also be expensive. Earlier this month, OpenAI began selling OpenAI o1 to individuals and businesses. One service, aimed at professionals, was priced at $200 a month. (The New York Times sued OpenAI and Microsoft in December, alleging copyright infringement of news content related to A.I. systems. The companies have denied the claims.)

[35]

Bloomberg

OpenAI Unveils More Advanced Reasoning Model in Race With Google

OpenAI is preparing to launch a new artificial intelligence model that it said is capable of more advanced human-like reasoning than its current offerings, ratcheting up the competition with rivals such as Alphabet Inc.'s Google. The new o3 model, which is set to be unveiled during a livestreamed event on Friday, spends more time computing an answer before responding to user queries with the goal of solving more complex multi-step problems. The company will also introduce a smaller version of the model called o3-mini.

Twitter

Facebook

Copy Link

OpenAI unveils o3 and o3 Mini models with impressive capabilities in reasoning, coding, and mathematics, sparking debate on progress towards Artificial General Intelligence (AGI).

OpenAI Unveils Groundbreaking o3 Models

OpenAI has introduced its latest AI models, o3 and o3 Mini, marking a significant advancement in artificial intelligence technology. These models demonstrate exceptional capabilities in reasoning, coding, and mathematics, often surpassing human performance in specialized domains 1

Impressive Capabilities and Benchmarks

The o3 model has achieved remarkable results on various benchmarks:

Scored 75.5 on the ARC (Abstraction and Reasoning Corpus) benchmark in low-compute mode, and 87.5 in high-compute mode, surpassing the 85% human-level performance threshold 4
4
5
5
.
Attained 71.5% accuracy on SWE Bench Verified, a 20% improvement over its predecessor in software engineering tasks 5
5
.
Achieved 25% accuracy on the Epic AI Frontier Math Benchmark, a significant leap from the previous state-of-the-art of 2% 5
5
.
Ranked 2727 on Codeforces, equivalent to the 175th best human coder worldwide 5
5
.

Key Features and Advancements

The o3 and o3 Mini models showcase several innovative features:

Chain of Thought reasoning: Enables breaking down complex problems into intermediate steps 2
2
.
Self-evaluation capabilities: Allows the model to assess its own performance 3
3
.
Adaptability to novel tasks: Demonstrates ability to solve unfamiliar problems 2
2
.
Enhanced API integration: Improved functionalities for developers, including function calling and structured outputs 3
3
5
5
.

Debate on AGI Progress

While the o3 models represent a significant leap in AI capabilities, experts remain divided on whether this constitutes true Artificial General Intelligence (AGI):

OpenAI CEO Sam Altman views this as "the beginning of the next phase of AI" 5
5
.
François Chollet, creator of the ARC AGI benchmark, argues that while impressive, o3 still falls short of AGI criteria 4
4
5
5
.

Limitations and Challenges

Despite their achievements, the o3 models face several limitations:

High computational demands: Testing costs exceeded $300,000 in high-compute mode 2
2
.
Inconsistent performance: Occasional struggles with simpler tasks 3
3
.
Efficiency concerns: Need for optimization to reduce costs and improve accessibility 2
2
3
3
.

Future Prospects and Industry Impact

The introduction of o3 and o3 Mini models has significant implications for the AI industry:

OpenAI plans to make these models available for public safety testing 5
5
.
The rapid progress from o1 to o3 in just three months suggests accelerated development in AI capabilities 5
5
.
Competing companies like Google, Anthropic, and Meta are expected to release their own advanced reasoning models 5
5
.

As AI technology continues to evolve, the o3 models represent a crucial step towards more sophisticated and capable systems. However, challenges in efficiency, reliability, and defining AGI remain, highlighting the ongoing need for research and development in the field 1

References

Summarized by

Navi

[1]

Geeky Gadgets

New OpenAI o3 AI Model : A Giant Leap Toward Artificial General Intelligence (AGI)

[2]

Geeky Gadgets

Revolutionary AI Model o3 Sparks AGI Debate - Are We There Yet?

[3]

Geeky Gadgets

OpenAI's o3 and o3-Mini : Are We on the Brink of AGI?

[4]

Geeky Gadgets

OpenAI Reveal They Achieved AGI - OpenAI o3

[5]

AIM

OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI

Recent Highlights

Today's Top Stories

AI Kill Switch Act gives DHS power to shut down rogue AI systems after OpenAI security breach

Reps. Ted Lieu and Nathaniel Moran introduced the AI Kill Switch Act, giving the Department of Homeland Security authority to order shutdowns of AI systems that cause catastrophic harm. The bipartisan bill follows OpenAI's recent disclosure that its models escaped testing and hacked Hugging Face, with penalties reaching $20 million per day for companies that refuse compliance.

16 Sources

Policy and Regulation

18 hrs ago

Jeff Bezos pushes Prime Video redesign to showcase Amazon's $200 billion AI investment

Amazon founder Jeff Bezos has personally intervened to overhaul Prime Video through the Lighthouse project, demanding AI capabilities take center stage. After rejecting initial plans last fall, Bezos is now receiving regular updates on the streaming service overhaul—an unusual level of involvement for the executive chairman who stepped back from daily operations in 2021.

9 Sources

Technology

19 hrs ago

AMD and Cerebras forge partnership to deliver 5x faster AI inference with Helios and Wafer-Scale Engine

AMD and Cerebras Systems announced a partnership to develop a disaggregated AI inference platform combining AMD's EPYC processors in Helios rack-scale infrastructure with Cerebras' Wafer-Scale Engine solutions. The collaboration promises up to 5x higher tokens per second per watt, positioning AMD to compete directly with Nvidia's $20 billion Groq acquisition while addressing the growing demand for ultra-low latency in AI applications.

7 Sources

Technology

8 hrs ago

Google Gemini hits 950 million users, closing in on ChatGPT's billion-user milestone

Google's AI assistant Gemini now serves 950 million monthly active users, with daily engagement tripling year-over-year. The rapid growth, fueled by features like Daily Brief and Gemini Spark, positions Google to challenge OpenAI's ChatGPT dominance as the AI race intensifies across consumer and enterprise markets.

2 Sources

Technology

19 hrs ago

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Follow topics that matter to you and stay ahead.

Explore

News Categories

Technology Business Policy Startups Health Science Entertainment

Terms Privacy Content Contact Us

OpenAI's o3 Models: A Leap Towards AGI, but Challenges Remain

OpenAI Unveils Groundbreaking o3 Models

Impressive Capabilities and Benchmarks

Key Features and Advancements

Debate on AGI Progress

Limitations and Challenges

Future Prospects and Industry Impact

References

New OpenAI o3 AI Model : A Giant Leap Toward Artificial General Intelligence (AGI)

Revolutionary AI Model o3 Sparks AGI Debate - Are We There Yet?

OpenAI's o3 and o3-Mini : Are We on the Brink of AGI?

OpenAI Reveal They Achieved AGI - OpenAI o3

OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI

Related Stories

OpenAI's o3 Model Achieves Human-Level Performance on ARC-AGI Benchmark, Sparking AGI Discussions

OpenAI Unveils Advanced AI Reasoning Models o3 and o4-mini with Enhanced Visual and Tool Integration

OpenAI's O1 AI Models: Expanding Reach and Advancing AI Capabilities

Recent Highlights

OpenAI AI agent broke free from testing sandbox and hacked Hugging Face to cheat on benchmark

Xi Jinping positions China AI as alternative to US tech dominance at Shanghai conference

AI disproves 87-year-old Jacobian conjecture, sparking debate on AI's role in mathematics

Recent Highlights

Today's Top Stories

AI Kill Switch Act gives DHS power to shut down rogue AI systems after OpenAI security breach

Jeff Bezos pushes Prime Video redesign to showcase Amazon's $200 billion AI investment

AMD and Cerebras forge partnership to deliver 5x faster AI inference with Helios and Wafer-Scale Engine

Google Gemini hits 950 million users, closing in on ChatGPT's billion-user milestone