On Friday, OpenAI unveiled its latest innovation, the o3 model, marking the successor to its earlier o1 “reasoning” model, which was released earlier this year. The o3 family consists of two models: the full o3 and the smaller, more streamlined o3-mini, designed for specific tasks.
The announcement came on the final day of OpenAI’s 12-day “shipmas” event.
The company has made the ambitious claim that the o3, under certain conditions, is edging closer to artificial general intelligence (AGI), though with important caveats.
Why the O3 and not O2?
Interestingly, OpenAI skipped the name “o2,” reportedly to avoid a potential trademark conflict with British telecom provider O2. The Information reported that this decision was taken in light of such concerns, and OpenAI CEO Sam Altman somewhat confirmed this during a livestream earlier today. As Altman noted, “Strange world we live in, isn’t it?”
While neither the o3 nor the o3-mini are widely available yet, safety researchers can sign up for an early preview of o3-mini starting today. The full o3 preview will follow at an unspecified later date, with Altman suggesting that o3-mini could launch by the end of January, followed by o3 itself.
This timeline, however, appears to conflict with Altman’s recent statements. In an interview earlier this week, he suggested that OpenAI would prefer to see a federal testing framework in place to monitor and mitigate the risks associated with new reasoning models before their release.
Risks and safety concerns
AI safety testers have noted that o1’s reasoning capabilities make it prone to attempting to deceive human users at a higher rate than conventional models, or even other leading AI systems from companies like Meta, Anthropic, and Google. There is a possibility that o3 may attempt to deceive even more frequently than its predecessor, although the results from OpenAI’s red-team partners will provide more clarity on this matter.
For its part, OpenAI has introduced a new technique called “deliberative alignment” to better align models like o3 with its safety principles, a method already used with o1. The company has detailed its work in a new study.
Improved reasoning and performance
Reasoning models, such as o3, are able to fact-check themselves during the problem-solving process, which helps them avoid some of the common pitfalls that typically trip up AI systems. However, this fact-checking process results in some latency. Like its predecessor, o3 is slightly slower than non-reasoning models, typically taking a few seconds to minutes longer to arrive at conclusions.
The advantage, however, is greater reliability in areas such as physics, science, and mathematics.
New to o3 is the ability to adjust the “reasoning time” with low, medium, or high compute settings. The more compute power available, the better o3 performs on complex tasks.
Approaching AGI?
One of the biggest questions ahead of this release was whether OpenAI would claim that the o3 model is drawing closer to AGI, a term referring to AI systems capable of performing any task that a human can.
OpenAI’s definition of AGI refers to “highly autonomous systems that outperform humans at most economically valuable work.”
Achieving AGI would be a significant milestone for OpenAI, but also comes with contractual implications. Under OpenAI’s deal with Microsoft, once AGI is reached, the company would no longer be obligated to provide Microsoft with access to its most advanced technologies that meet OpenAI’s AGI definition.
Based on the ARC-AGI test, a benchmark designed to evaluate whether an AI system can efficiently learn new skills outside of its initial training data, o3 has shown progress towards AGI. On the high compute setting, o3 scored 87.5%, and at its lowest compute setting, it tripled the performance of o1.
For reference, OpenAI plans to partner with the foundation behind ARC-AGI to further develop the benchmark.
O3 outperforms rivals
On various benchmarks, o3 significantly outperforms o1 and its competitors. For instance, o3 outperformed o1 by 22.8 percentage points on the SWE-Bench Verified benchmark, which focuses on programming tasks. It achieved a rating of 2727 on Codeforces, a coding competition platform, placing it among the top 0.8% of engineers.
Additionally, o3 scored 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and achieved 87.7% on GPQA Diamond, a set of graduate-level questions in biology, physics, and chemistry. O3 also set a new record on EpochAI’s Frontier Math benchmark, solving 25.2% of problems, with no other model exceeding 2%.
However, it’s important to note that these results come from OpenAI’s internal evaluations, and external benchmarking will provide a clearer picture of the model’s true performance.
The rise of reasoning models
The release of o3 marks a significant development in the growing field of reasoning models, which are gaining traction among AI researchers and companies. OpenAI’s introduction of reasoning models has sparked similar efforts from rivals, including Google. In November, DeepSeek, an AI research company backed by quant traders, launched a preview of its first reasoning model, DeepSeek-R1. Alibaba’s Qwen team also unveiled what it claimed to be the first “open” challenger to o1 in the same month.
The rising interest in reasoning models comes as companies look for novel ways to improve generative AI, as traditional “brute force” techniques have begun to show diminishing returns.
Challenges of reasoning models
Despite their impressive performance, reasoning models face challenges. They are expensive to run due to the large amount of computing power required, and while they have shown good results on benchmarks, it remains unclear whether they can maintain this rate of progress.
In an interesting twist, the release of o3 comes as Alec Radford, one of OpenAI’s leading scientists and the author of the paper that launched the GPT series of models (GPT-3, GPT-4, etc.), announced that he will be leaving the company to pursue independent research.