How transformer-based networks are improving self-driving software

The architecture behind LLMs is helping autonomous vehicles drive more smoothly.

Sep 11, 2024

∙ Paid

It’s Autonomy Week! This is the third of five articles exploring the state of the self-driving industry.

As I climbed into the self-driving Prius for a demo ride, my host handed me a tablet. In the upper-left hand corner, it said “Ask Nuro Driver.” There was a big button at the bottom that said “What are you doing? Why?”

I pushed the button and the tablet responded: “I am stopped because I am yielding to a pedestrian who might cross my path.”

A few seconds later, I pushed it again: “I am accelerating because my path is clear.”

I was having a conversation with a self-driving car. Sort of.

Nuro gave me a ride in a driverless Prius that looked something like this. (Photo courtesy of Nuro)

Nuro used to be one of the hottest startups in the self-driving industry, raising $940 million in 2019, $500 million in 2020, and another $600 million in 2021. The company had big plans to deploy thousands of street-legal delivery robots.

But then Nuro had a crisis of confidence. It laid off almost half of its workers between November 2022 and May 2023. In a May 2023 blog post, Nuro’s founders announced they would delay Nuro’s next generation of delivery robots to conserve cash while they focused on research and development.

“Recent advancements in AI have increased our confidence and ability to reach true generalized and scaled autonomy faster,” the founders wrote. “Our focus now will be on making our autonomy stack even more data driven.”

One result of that shift: the tablet I held in my hands during last month’s demo ride.

It might seem silly to chat with a self-driving car, but it would be a mistake to dismiss this as just a gimmick. Borrowing techniques from large language models has made Nuro’s robots smoother, more confident drivers, according to Nuro chief operating officer Andrew Chapin.

And Nuro isn’t alone. In April, a British startup called Wayve announced LINGO-2, “the first language model to drive on public roads.” A video showed a vehicle driving through busy London streets while explaining its actions with English phrases like “reducing speed for the cyclist.” Investors were so impressed they invested $1 billion in Wayve in May.

Wayve portrays its more established competitors as dinosaurs wedded to obsolete technology. But the dinosaurs aren’t standing still. Some have started using the transformer, the architecture underlying LLMs, in their own autonomy stacks.

“We really leveraged that technology of transformers for behavioral prediction, for decision-making, for semantic understanding,” said Dmitri Dolgov, co-CEO of Google’s Waymo, in a February interview. Dolgov argued that Waymo’s self-driving software is “very nicely complementary with the world knowledge and the common sense that we get” from transformer-based language models, adding that “recently we’ve been doing work to combine the two.”

Indeed, since 2022, Waymo has published at least eight papers detailing its use of transformer-based networks for various aspects of the self-driving problem. These new networks have helped to make the Waymo Driver smoother and more confident on the road.

As impressive as these early results are, it’s not obvious that things will progress all the way to Wayve’s vision of a single LLM-like network driving our cars. More likely, self-driving systems in the future will use a mix of transformer-based networks and more traditional techniques. It’ll take a lot of trial and error to figure out the best combination of techniques to provide passengers with safe and comfortable rides.

Transformers: more than meets the eye

*Google controlled this robot with a modified LLM called RT-2. (Image courtesy of Google DeepMind)*

If you read my explainer on large language models last year, you know that LLMs are trained to predict the next token in a sequence of text. Although these models were initially text-only, researchers soon found that variants of the transformer architecture could be applied to a wide range of other domains, including images, audio files, and even sequences of amino acids.

They also found that transformer-based models can be multimodal. OpenAI’s GPT-4o, for example, can accept a mix of text, images, and audio. Under the hood, GPT-4o represents each image or snippet of audio as a sequence of tokens, which are then thrown into the same stream as the text tokens.

LLMs can also be trained to output new kinds of tokens. And researchers at Google found this ability was useful for robotics. Last year, Google DeepMind announced a vision-language-action model called RT-2. RT-2 takes in images from a robot’s cameras and textual commands from a user and outputs “action tokens” that give low-level instructions to the robot.

To accomplish this, researchers started with a conventional multimodal LLM trained on text and images harvested from the web. This model was fine-tuned using a database of robot actions. Each training example would include a high-level command (like “pick up the banana”) as well as a sequence of low-level robot actions (like “rotate second joint by 7 units, then rotate first joint by 9 units, then close gripper by 5 units”). Google created the database by having a human being execute each command using a video game controller hooked up to a robot arm.

The resulting model was not only capable of carrying out new tasks, it showed a remarkable ability to generalize across domains. For example, researchers placed several flags on a table along with a banana, then told the robot: “move banana to Germany.” RT-2 recognized the German flag and placed the banana on top of it.

The robotics training set used for fine-tuning probably didn’t have any examples involving German flags. However, the underlying model was trained on millions of images harvested from the web—some of which undoubtedly did have German flags. The model retained its understanding of flags as it was being fine-tuned to output robot commands, resulting in a model that understood both domains.

This line of research pointed to tantalizing possibilities for the self-driving industry. For example, self-driving software needs to recognize objects like fire trucks or stop signs. Traditionally, self-driving companies would build training sets manually, collecting thousands of examples of fire trucks and stop signs from driving footage, then labeling them by hand.

But what if that wasn’t necessary? What if it were possible to get comparable—or even better—performance by tweaking a pre-trained image model that wasn’t originally designed for self-driving? Not only would that save a lot of human labor, it might lead to models that can recognize a much wider range of objects.

And what about those action tokens RT-2 generated to control robots? Could a similar technique enable transformer-based foundation models to control self-driving vehicles directly?

Nobody is better positioned to answer these questions than Vincent Vanhouke, who led the Google robotics team that invented RT-2. In August, Vanhouke announced he was going to Waymo to explore how to use foundation models to develop “safer and smarter autonomous vehicles.”

Vanhoucke has only been at Waymo for a few weeks. But other Waymo researchers have been experimenting with transformers for several years—and in some cases, publishing their results in scientific papers.

Keep reading with a 7-day free trial

Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.