RT-2: Controlling a robot using large language models (LLMs)

google deepmind large language models (llm) vision-language models (vlm) Aug 14, 2023

It’s no surprise that language models like ChatGPT have taken the world by storm. However, as time passes, researchers are discovering new capabilities and properties of these models. Recently, DeepMind released a model called RT-2 (Robotic Transformer 2), which can control a robotic arm to perform tasks in response to natural language commands such as “place the cube next to the scissors.” Furthermore, RT-2 is capable of performing tasks that involve objects and even actions that it hasn’t previously seen. The model is capable of producing sequences of low-level actions that a robotic arm can understand and carry out. In this article, we will explore how they accomplished this feat.


Get a monthly compilation of the greatest advancements in AI straight to your inbox: https://www.evlabs.io/joinus


Vision-Language Models

The research team took existing vision-language models (VLM) as their starting point. These are a type of language model trained to accept both images and text as input and produce a text sequence in response. Typically, these models are trained using multiple datasets, each with billions of image-text pairs, to accomplish various tasks such as image captioning or visual Q&A.

The researchers used two extensions of the PaLM language model created by Google, known as PaLI-X and PaLM-E. These models are capable of performing a variety of complex visual interpretation and reasoning tasks, but their responses are in natural language rather than instructions that a robot can interpret.


Vision-Language-Action Models

To solve this problem, researchers decided to represent the actions that the robot can understand and execute as text tokens. That is, the type of outputs that language models can produce.

The actions that the robot can carry out consist of 6 values for positional and rotational displacement of the robot’s “hand”, as well as the extension of its gripper (the part that grabs and releases objects). In addition, there is a special command to indicate that the task has been completed.

That is, for a language model to be able to control our robot, it must respond to our requests (tasks) with a sequence like the following, which the robot knows how to interpret:



Except for the termination command, the rest of the values are continuous, so to simplify the task, we need to discretize their range of possible values into 256 segments. That is, each of the numerical values in the previous command can take an integer value between 0 and 256. Therefore, our language model could generate a text string like the following, and when converted into the corresponding displacement and rotation values, our robot will know how to interpret them:


“1 128 91 241 5 101 127”


Once the problem of how to translate the output of our language model into instructions that the robot can execute is solved, the most important question remains: how to train the language model to generate the appropriate commands to solve a specific task?

To do this, researchers used data collected by a robot interacting in a kitchen environment. Each sample of this dataset includes a text representation of the task to be performed, describing the action and the objects involved, images from the robot’s camera, as well as the actions carried out by the robot to solve the task.

Using this data, they fine-tuned the language model to produce actions to solve a task based on a text description and camera image. To ensure that the resulting language model maintains its ability to generalize and understand instructions and objects not necessarily included in the robotics dataset, they also used the original data on which the language models were originally trained. This helped them to reinforce the concepts and abilities they originally acquired.

Thus, they carried out a co-fine-tuning process in which they taught the model to produce usable actions for their robot and reinforced the skills it had already learned during its original training phase.



The tests carried out by DeepMind researchers focused mainly on the model’s ability to generalize. That is, its ability to understand the objects, actions, and context involved in the task we ask it to perform, even if it hasn’t seen them before.

To evaluate these capabilities in RT-2, the team took 200 samples from the robotics dataset they used, which had been excluded from training. In addition, they compared their results with those of previous models. In the following graph, you can see how RT-2 outperformed all its predecessors, both in tasks with known elements, and in tasks where some aspect was unknown to the model.



They also tested RT-2 agains the other baseline models in a simulated environment, where it also came out victorious. It is noteworthy that in this experiment, RT-2 showed behaviors not included in the data from which it learned, specifically, "pushing" tasks. Also, it recognized objects not before seen in the environment.



Emergent capabilities

The abilities of the resulting model in terms of semantic understanding and basic reasoning ability in the context of the scene have been summarized in the following graph. In it, we can observe that RT-2 outperforms previous models in all categories (symbol understanding, reasoning and human recognition).




This paper demonstrates that it is possible to leverage the visual reasoning abilities of Vision-Language Models (VLMs) to solve hard, domain specific, tasks. Furthermore, these abilities translate well in new contexts and allow the resulting agents to exhibit complex and nuanced behavior even if they haven’t been explicitly trained for them.


AI moves fast. We help you keep up with it.

Get a monthly selection of the most groundbreaking advances in the world of AI, top code repositories, and our best articles and tutorials. 

We hate SPAM. We will never sell your information, for any reason.