Google’s RT-2 AI model brings us one step closer to WALL-E

Google’s RT-2 AI model brings us one step closer to WALL-E

"First-of-its-kind" robot AI model can recognize trash and perform complex actions.

Benj Edwards – Jul 28, 2023 9:32 PM | 52

A Google robot controlled by RT-2. Credit: Google

More than meets the eye

With RT-2, Google DeepMind has adopted a strategy that plays on the strengths of transformer AI models, known for their capacity to generalize information. RT-2 draws on earlier AI work at Google, including the Pathways Language and Image model (PaLI-X) and the Pathways Language model Embodied (PaLM-E). Additionally, RT-2 was also co-trained on data from its predecessor model (RT-1), which was collected over a period of 17 months in an "office kitchen environment" by 13 robots.

The RT-2 architecture involves fine-tuning a pre-trained VLM model on robotics and web data. The resulting model processes robot camera images and predicts actions that the robot should execute.

Google fine tuned a VLM model on robotics and web data. The resulting model takes in robot camera images and predicts actions for a robot to perform. — Google fine-tuned a VLM model on robotics and web data. The resulting model takes in robot camera images and predicts actions for a robot to perform. Credit: Google

Since RT-2 uses a language model to process information, Google chose to represent actions as tokens, which are traditionally fragments of a word. "To control a robot, it must be trained to output actions," Google writes. "We address this challenge by representing actions as tokens in the model’s output—similar to language tokens—and describe actions as strings that can be processed by standard natural language tokenizers."

In developing RT-2, researchers used the same method of breaking down robot actions into smaller parts as they did with the first version of the robot, RT-1. They found out that by turning these actions into a series of symbols or codes (a "string" representation), they could teach the robot new skills using the same learning models they use for processing web data.

The model also utilizes chain-of-thought reasoning, enabling it to perform multi-stage reasoning like choosing an alternative tool (a rock as an improvised hammer) or picking the best drink for a tired person (an energy drink).

According to Google, chain-of-thought reasoning enables a robot control model that perform complex actions when instructed. — According to Google, chain-of-thought reasoning enables a robot control model to perform complex actions when instructed. Credit: Google

Google says that in over 6,000 trials, RT-2 was found to perform as well as its predecessor, RT-1, on tasks that it was trained for, referred to as "seen" tasks. However, when tested with new, "unseen" scenarios, RT-2 almost doubled its performance to 62 percent compared to RT-1's 32 percent.

Although RT-2 shows a great ability to adapt what it has learned to new situations, Google recognizes that it's not perfect. In the "Limitations" section of the RT-2 technical paper, the researchers admit that while including web data in the training material "boosts generalization over semantic and visual concepts," it does not magically give the robot new abilities to perform physical motions that it hasn't already learned from its predecessor's robot-training data. In other words, it can't perform actions it hasn't physically practiced before, but it gets better at using the actions it already knows in new ways.

While the ultimate goal for Google DeepMind is to create general-purpose robots, the company knows that there is still plenty of research work ahead before it gets there. But technology like RT-2 seems like a strong step in that direction.

Google’s RT-2 AI model brings us one step closer to WALL-E

Ars Video

More than meets the eye

nproxy.org