OpenAI's latest model, o1, isn't getting nearly as much buzz as its predecessors GPT-3 and GPT-4. But why?
Let's start with a real-world example. Here's a comparison between o1 and Claude converting a Figma design into code via Builder.io:
As you can see, o1 is significantly slower. It's also more expensive, and only sometimes better.
So why is OpenAI investing so heavily in o1? It all comes down to this:
Source: Alex Albert
Each new large language model (LLM) that comes out seems to be only incrementally better than the last. In fact, it's expected that OpenA/I's latest model, Orion, isn't even always better than GPT-4.
Why? We're simply running out of data to train on.
If we're hitting limits on how smart we can make AI models, what's the next move? Making them faster.
AI players are investing in specialized hardware to speed up inference and lower costs. They're building their own data centers and even looking into nuclear power for more sustainable, cheaper energy.
Companies like Groq and Cerebras have seen up to 10x performance increases with LLM-optimized hardware. This isn't just theoretical - Amazon's already released their new chips, and Apple is planning to use them.
Faster inference could open up new workflows that weren't feasible before due to long wait times and poor user experience.
But here's the million-dollar question: does being faster matter if AI can't get any smarter? Or to put it another way:
If the way we train models today isn't getting better, if AI is plateauing on intelligence, can we use increased speed and decreased cost to find another path to smarter AI outputs?
The answer might surprise you.
Let's borrow a concept from Daniel Kahneman. He talks about two systems of thinking:
- System 1: Fast, automatic thinking. Like knowing 3 + 4 = 7 without having to calculate it.
- System 2: Slower, more deliberate thinking. Like solving a complex math problem step by step.
System 1 thinking is automatic.
System 2 thinking requires taking things step-by-step.
Using System 2 thinking is used to break down complicated problems into simpler steps.
For example, you likely don’t have the answer to 39 + 48 memorized the same way as you know the answer to 3 + 4.
Instead, you’d need to break the more complicated question down into steps like this:
Current LLMs work a lot like System 1 thinking. They give you an answer fast, but as complexity increases, accuracy can suffer.
O1, on the other hand, is more like System 2 thinking. It breaks down complex problems into smaller, manageable steps. This approach can help with one of LLMs' biggest weaknesses: hallucinations.
For example, if you ask o1 how many Rs are in "strawberry", it might first guess incorrectly. But then it'll go through the word letter by letter, count the Rs, and give you the right answer.
This step-by-step approach isn't entirely novel. Chain-of-thought techniques have been around for a while.
In fact, open-source models, like the QwQ model from Alibaba, are already beginning to use this approach with similar performance. What's new is training a model to use this approach specifically.
As speeds increase and costs decrease, we might be able to afford this extra time to get better answers without hurting the user experience.
The problem is, o1 doesn't always give a better answer to every type of problem. But it is always more expensive and slower.
Currently, o1 Preview costs four times more per token than Claude Sonnet. It also outputs 2-10 times more tokens because of all the "thinking" it does.
This means an o1 output could cost up to 40 times more than Sonnet - and it's not always better.
The slowness is a big issue too. You often don't see any results from o1 for 10, 20, 30 seconds or more. That's a much worse user experience.
Conversion above done with Builder.io using claude-3-5-sonnet-20241022 and o1-preview-2024-09-12 (latest API-accessible model versions). I hand checked the results on ChatGPT with the latest o1 and didn't see any major differences in speed or accuracy.
The product in the video above costs $20 per month now. If the o1 model costs up to 40 times more, it doesn’t seem worth it to pay $800 a month for an offering that is slower and only marginally better, if at all.
So why is this still interesting? AI agents.
We want to use AI to complete a series of tasks without constant human supervision.
For that to work, AI needs to be better at breaking things down and completing tasks step by step. It also needs to have a lower failure rate and catch its own mistakes sooner.
Traditional LLMs are great at completing the next word in a sentence, but they weren't trained to break down and execute tasks.
For instance, Claude’s computer use today only has a 15% success rate at accomplishing real-world tasks. O1 is showing us what happens when we do train for that.
Source: Anthropic
How much better can this get with novel training methods? Will this lead to new breakthroughs, or are we heading for an AI bubble burst?
The average person's life hasn't changed much despite all the AI hype. If AI models aren't getting that much better, and these new techniques don't lead to major breakthroughs, some of the hype might start to cool off.
And you know what? That might not be the worst thing.
As the dot-com bubble peaked in 2000, people invested in the web as if its major challenges had already been solved. There was widespread excitement about the web's potential for explosive growth and profitability.
However, that excitement deflated when companies burned through hundreds of millions of venture capital and failed to become profitable. Only a few companies survived, and they now serve as models for how online businesses can succeed.
Today, we're seeing a similar pattern with AI—growing excitement and venture capital flowing freely. The enthusiasm mirrors what we saw during the dot-com bubble's growth.
But there will be hard problems. Progress will come in S-curves. While AI could solve an immense number of problems, it might not happen today. Some AI applications might not be effective enough for mass adoption for another decade.
Remember Webvan and Pets.com? They were viable internet businesses - just 10 years too early.
Before you hit me with the "but this time it's different, man" – don't forget that that's what people say every time.
While the future is uncertain, some AI use cases are already working well:
- AI chat assistants for brainstorming, researching, writing, and editing
- AI-assisted coding, whether through specialized IDEs, Copilot, full application builders, or tools that convert designs into high-quality code
These categories are seeing major adoption with generally happy users.
Maybe AI agents will change everything as soon as next year. Or maybe it'll take longer. We just don't know yet.
What we do know is that some AI applications are already proving their worth. Those are the ones I'm watching closely, and I'm excited to see what comes next.
Introducing Visual Copilot: convert Figma designs to high quality code in a single click.