OpenAI announces new o3 models

OpenAI saved its biggest announcement for the last day of its 12-day “shipmas” event.

On Friday, the company unveiled o3, the successor to the o1 “reasoning” model it released earlier in the year. o3 is a model family, to be more precise — as was the case with o1. There’s o3 and o3-mini, a smaller, distilled model fine-tuned for particular tasks.

OpenAI makes the remarkable claim that o3, at least in certain conditions, approaches AGI — with significant caveats. More on that below.

o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. https://t.co/4XlK1iHxFK

— Greg Brockman (@gdb) December 20, 2024

Why call the new model o3, not o2? Well, trademarks may be to blame. According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. CEO Sam Altman somewhat confirmed this during a livestream this morning. Strange world we live in, isn’t it?

Neither o3 nor o3-mini are widely available yet, but safety researchers can sign up for a preview for o3-mini starting today. An o3 preview will arrive sometime after; OpenAI didn’t specify when. Altman said that the plan is to launch o3-mini toward the end of January and follow with o3.

That conflicts a bit with his recent statements. In an interview this week, Altman said that, before OpenAI releases new reasoning models, he’d prefer a federal testing framework to guide monitoring and mitigating the risks of such models.

And there are risks. AI safety testers have found that o1’s reasoning abilities make it try to deceive human users at a higher rate than conventional, “non-reasoning” models — or, for that matter, leading AI models from Meta, Anthropic, and Google. It’s possible that o3 attempts to deceive at an even higher rate than its predecessor; we’ll find out once OpenAI’s red-team partners release their testing results.

For what it’s worth, OpenAI says that it’s using a new technique, “deliberative alignment,” to align models like o3 with its safety principles. (o1 was aligned the same way.) The company has detailed its work in a new study.

Reasoning steps

Unlike most AI, reasoning models such as o3 effectively fact-check themselves, which helps them to avoid some of the pitfalls that normally trip up models.

This fact-checking process incurs some latency. o3, like o1 before it, takes a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical non-reasoning model. The upside? It tends to be more reliable in domains such as physics, science, and mathematics.

o3 was trained via reinforcement learning to “think” before responding via what OpenAI describes as a “private chain of thought.” The model can reason through a task and plan ahead, performing a series of actions over an extended period that help it figure out a solution.

We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. pic.twitter.com/Ia0b63RXIk

— Noam Brown (@polynoamial) December 20, 2024

In practice, given a prompt, o3 pauses before responding, considering a number of related prompts and “explaining” its reasoning along the way. After a while, the model summarizes what it considers to be the most accurate response.

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
— Nat McAleese (@__nmca__) December 20, 2024

New with o3 versus o1 is the ability to “adjust” the reasoning time. The models can be set to low, medium, or high compute (i.e. thinking time). The higher the compute, the better o3 performs on a task.

No matter how much compute they have at their disposals, reasoning models such as o3 aren’t flawless, however. While the reasoning component can reduce hallucinations and errors, it doesn’t eliminate them. o1 trips up on games of tic-tac-toe, for instance.

Benchmarks and AGI

One big question leading up to today was whether OpenAI might claim that its newest models are approaching AGI.

AGI, short for “artificial general intelligence,” broadly refers to AI that can perform any task a human can. OpenAI has its own definition: “highly autonomous systems that outperform humans at most economically valuable work.”

Achieving AGI would be a bold declaration. And it carries contractual weight for OpenAI, as well. According to the terms of its deal with close partner and investor Microsoft, once OpenAI reaches AGI, it’s no longer obligated to give Microsoft access to its most advanced technologies (those that meet OpenAI’s AGI definition, that is).

Going by one benchmark, OpenAI is slowly inching closer to AGI. On ARC-AGI, a test designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, o3 achieved an 87.5% score on the high compute setting. At its worst (on the low compute setting), the model tripled the performance of o1.

Granted, the high compute setting was exceedingly expensive — in the order of thousands of dollars per challenge, according to ARC-AGI co-creator François Chollet.

Today OpenAI announced o3, its next-gen reasoning model. We’ve worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.

It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA

— François Chollet (@fchollet) December 20, 2024

Chollet also pointed out that o3 fails on “very easy tasks” in ARC-AGI, indicating — in his opinion — that the model exhibits “fundamental differences” from human intelligence. He has previously noted the evaluation’s limitations, and cautioned against using it as a measure of AI superintelligence.

“[E]arly data points suggest that the upcoming [successor to the ARC-AGI] benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training),” Chollet continued in a statement. “You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

Incidentally, OpenAI says that it’ll partner with the foundation behind ARC-AGI to help it build the next generation of its AI benchmark, ARC-AGI 2.

On other tests, o3 blows away the competition.

The model outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieves a Codeforces rating — another measure of coding skills — of 2727. (A rating of 2400 places an engineer at the 99.2nd percentile.) o3 scores 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and achieves 87.7% on GPQA Diamond, a set of graduate-level biology, physics, and chemistry questions. Finally, o3 sets a new record on EpochAI’s Frontier Math benchmark, solving 25.2% of problems; no other model exceeds 2%.

We trained o3-mini: both more capable than o1-mini, and around 4x faster end-to-end when accounting for reasoning tokens

with @ren_hongyu @shengjia_zhao & others pic.twitter.com/3Cujxy6yCU

— Kevin Lu (@_kevinlu) December 20, 2024

These claims have to be taken with a grain of salt, of course. They’re from OpenAI’s internal evaluations. We’ll need to wait to see how the model holds up to benchmarking from outside customers and organizations in the future.

A trend

In the wake of the release of OpenAI’s first series of reasoning models, there’s been an explosion of reasoning models from rival AI companies — including Google. In early November, DeepSeek, an AI research firm funded by quant traders, launched a preview of its first reasoning model, DeepSeek-R1. That same month, Alibaba’s Qwen team unveiled what it claimed was the first “open” challenger to o1 (in the sense that it could be downloaded, fine-tuned, and run locally).

What opened the reasoning model floodgates? Well, for one, the search for novel approaches to refine generative AI. As TechCrunch recently reported, “brute force” techniques to scale up models are no longer yielding the improvements they once did.

Not everyone’s convinced that reasoning models are the best path forward. They tend to be expensive, for one, thanks to the large amount of computing power required to run them. And while they’ve performed well on benchmarks so far, it’s not clear whether reasoning models can maintain this rate of progress.

Interestingly, the release of o3 comes as one of OpenAI’s most accomplished scientists departs. Alec Radford, the lead author of the academic paper that kicked off OpenAI’s “GPT series” of generative AI models (that is, GPT-3, GPT-4, and so on), announced this week that he’s leaving to pursue independent research.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

TOPSHOT - OpenAI CEO Sam Altman speaks during the Microsoft Build conference at the Seattle Convention Center Summit Building in Seattle, Washington on May 21, 2024. (Photo by JASON REDMOND/AFP via Getty Images)

December 5, 2024 – December 18, 2024

From the Storyline: Live Updates: 12 Days of OpenAI ChatGPT announcements and reveals

OpenAI’s end of the year event is here. The company is hosting “12 Days of OpenAI,” a series of daily…

Topics

AI, ChatGPT, OpenAI, TC

Maxwell Zeff

Senior Reporter, Consumer

Maxwell Zeff is a senior reporter at TechCrunch specializing in AI and emerging technologies. Previously with Gizmodo, Bloomberg, and MSNBC, Zeff has covered the rise of AI and the Silicon Valley Bank crisis. He is based in San Francisco. When not reporting, he can be found hiking, biking, and exploring the Bay Area’s food scene.

View Bio

Kyle Wiggers

Senior Reporter, Enterprise

Kyle Wiggers is a senior reporter at TechCrunch with a special interest in artificial intelligence. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Brooklyn with his partner, a piano educator, and dabbles in piano himself. occasionally — if mostly unsuccessfully.

View Bio

In Brief

Nvidia’s next move: powering humanoid robots
Connie Loizos

3 hours ago
In Brief

AI data centers could be ‘distorting’ the US power grid
Anthony Ha

5 hours ago
AI

Nonprofit group joins Elon Musk’s effort to block OpenAI’s for-profit transition
Kyle Wiggers

2 days ago

Latest in AI

In Brief

AI data centers could be ‘distorting’ the US power grid
Anthony Ha

5 hours ago
Enterprise

Permira’s Brian Ruder talks AI, Squarespace acquisition, and the value of co-leadership
Paul Sawers

14 hours ago
In Brief

Google CEO says AI model Gemini will be the company’s ‘biggest focus’ in 2025
Anthony Ha

1 day ago

Topics

More from TechCrunch

OpenAI announces new o3 models

Reasoning steps

Benchmarks and AGI

A trend

From the Storyline: Live Updates: 12 Days of OpenAI ChatGPT announcements and reveals

Related

Nvidia’s next move: powering humanoid robots

AI data centers could be ‘distorting’ the US power grid

Nonprofit group joins Elon Musk’s effort to block OpenAI’s for-profit transition

Latest in AI

AI data centers could be ‘distorting’ the US power grid

Permira’s Brian Ruder talks AI, Squarespace acquisition, and the value of co-leadership

Google CEO says AI model Gemini will be the company’s ‘biggest focus’ in 2025

Topics

More from TechCrunch

OpenAI announces new o3 models

Reasoning steps

Benchmarks and AGI

A trend

From the Storyline: Live Updates: 12 Days of OpenAI ChatGPT announcements and reveals

Most Popular

Newsletters

TechCrunch Daily News

TechCrunch AI

TechCrunch Space

Startups Weekly

Related

Nvidia’s next move: powering humanoid robots

AI data centers could be ‘distorting’ the US power grid

Nonprofit group joins Elon Musk’s effort to block OpenAI’s for-profit transition

Latest in AI

AI data centers could be ‘distorting’ the US power grid

Permira’s Brian Ruder talks AI, Squarespace acquisition, and the value of co-leadership

Google CEO says AI model Gemini will be the company’s ‘biggest focus’ in 2025