Like Human Like Robot
Like Human, Like Robot:
Exploring Model Architecture and Algorithm Design Decisions in Robotic Imitation


Abstract (abridged)

In a series of robotic manipulation experiments, we observed distinct performance variations based on model architectures, observation methods, and temporal strategies. The Model Architecture Experiment on Franka PyBullet Simulation indicated that shallower image encoders are more reliable for language-conditioned grasping compared to deeper variants, with the latter often misgrasping. However, deeper image encoder backbones benefited from additional MLP policy head layers, offsetting performance declines from pre-trained weights. The Observation Spaces Experiment on Robomimic MuJoCo Simulation demonstrated superior performance with wrist camera images, either alone or combined with 3rd-person views, whereas exclusive reliance on the latter proved inefficient. The Temporal Correlation Experiments using Franka highlighted the prowess of simpler algorithms in basic tasks, but emphasized the unique potential of future action predictions in complex tasks with limited visibility. Cumulatively, our findings underscore the efficacy of shallow encoders, the advantages of wrist cameras, and the promise yet variability of temporal correlations in robotic tasks.



Method Overview

Our primary objective is to empower continuous-control robotic policies to proficiently execute manipulation tasks. This is achieved by training these policies using Behavior Cloning (BC) on offline demonstration datasets. The models are designed to accept a diverse range of inputs including image observations which can be derived from either a wrist-mounted camera or a 3rd-person perspective. Additionally, they incorporate the proprioceptive state of the robot, such as the end-effector pose, and also benefit from language annotations that provide descriptive insights into the tasks. The outputs of these models are intricately detailed, producing a 6-DoF (Degrees of Freedom) delta position/rotation and also accounting for a 1-DoF gripper action.

Diving deeper into our experiments, the Model Architecture Experiment conducted on Franka PyBullet Simulation is primarily centered around language-conditioned target cube grasping using single-step BC. This experiment compares the performances of shallow against deep image encoder backbones, all of which are tested with a fixed-size Multilayer Perceptron (MLP) policy head. A noteworthy aspect of our technique in this experiment is the use of a late-fusion method where image and language embeddings are concatenated for enhanced performance.

In another experiment, the Observation Spaces Experiment on Robomimic MuJoCo Simulation, our focus is split between two tasks. The first is the Lift task, dedicated to grasping, and the second, the Can task, is a pick-and-place activity. Both these tasks are governed by the principles of single-step BC. A pivotal part of this experiment is our exploration into the efficiency and applicability of various observation spaces.

Lastly, our Temporal Correlation Experiments conducted in a real-world scenario with Franka brings forth tasks like the Can pick-and-place and the Box reorienting and stacking. The crux of this experiment is to gauge the comparative advantages of encoding historical data versus forecasting imminent action sequences, a study made even more compelling when the only input is sourced from wrist camera images.

We invite readers to delve deeper into the findings and insights from these experiments.


Experiment 1 - Franka (PyBullet Sim)

In this section, we show rollouts of Language-conditioned target cube grasping policies. We use late fusion and look at different image encoders with a set MLP size and further explor the deeper image encoder models with a +9-layer MLP head. All of these are concatenated with the DistilBERT language encoder.

success rate

Model
4-layer CNN 100%
ResNet-18, trained from scratch 100%
ResNet-18, pre-trained and fine-tuned 78%
ResNet-18, trained from scratch (+9 MLP policy layers) 100%
ResNet-18, pre-trained and fine-tuned(+9 MLP policy layers) 100%
EfficientNet-B4, trained from scratch 67%
EfficientNet-B4, pre-trained and fine-tuned 67%
EfficientNet-B4, trained from scratch (+9 MLP policy layers) 100%
EfficientNet-B4, pre-trained and fine-tuned(+9 MLP policy layers) 100%

Language-conditioned target cube grasping

Shallow Model

4-Layer CNN

Below are sample rollouts of the Deeper image encoders. Although training with pretrained weights or from scratch may or may not help, in either case adding a larger MLP head size caused these deeper models to successfully complete the task. All clips are sped up by 9x.

Deeper Models

ResNet-18, trained from scratch
ResNet-18, trained from scratch (+9 MLP policy layers)
ResNet-18, pre-trained and fine-tuned
ResNet-18, pre-trained and fine-tuned(+9 MLP policy layers)

EfficientNet-B4, srained from scratch
EfficientNet-B4, trained from scratch (+9 MLP policy layers)
EfficientNet-B4, pre-trained and fine-tuned
EfficientNet-B4, pre-trained and fine-tuned(+9 MLP policy layers)

Experiment 2 - Robomimic (MuJoCo Sim)

In this section, we show rollouts of different observation spaces used in a Lift task (grasping) and a Can task (pick-and-place). We are evaluating performance of wrist camera images, 3rd-person camera images, and both wrist + 3rd-person camera images. All clips are sped up by 7x.

success rate

Lifting Task
Wrist camera images 100%
3rd person camera images 60%
Wrist + 3rd-person camera images 100%
Can pick-and-place
Wrist camera images 84%
3rd person camera images 0%
Wrist + 3rd-person camera images 88%

Lifting task

Wrist camera images
3rd person camera images
Wrist + 3rd-person camera images

Can pick-and-place

Wrist camera images
3rd person camera images
Wrist + 3rd-person camera images

As shown above, a policy using only 3rd-person camera robot demonstrations underperforms and is unable to complete the full task . In contrast, using wrist camera robot demonstrations or wrist + 3rd-person camera images outperforms 3rd-person camera images.

Camera perspective

Lift 3rd-person video
Lift eye-in-hand video

Can pick-and-place 3rd-person video
Can pick-and-place eye-in-hand video

Experiment 3 - Franka (Real World)

In this section, we show rollouts of different temporal correlation concepts used in a can pick-and-place task and box reorienting and stacking task. We are evaluating performance of stateless Vs. encoding history, Vs. predicting future action sequence using just wrist camera images.

success rate

Lifting Task
single-step BC 90%
BC-RNN 30%
open-loop BC-RNN 100%
ACT 60%
Can pick-and-place
single-step BC 0%
BC-RNN 0%
open-loop BC-RNN 0%
ACT 80%

Can pick-and-place task

single-step BC
BC-RNN
open-loop BC-RNN
ACT

Box reorienting and stacking task

single-step BC
BC-RNN
open-loop BC-RNN
ACT

As shown above, ACT is the most optimal choice. Future action predictions improves success rates in tasks with severe partial observability, but can suffer lower performance that may be caused by overfitting to suboptimal, highly multimodal human demos as shown in the lift task.

Camera perspective

Can pick-and-place eye-in-hand video
Box reorienting and stacking eye-in-hand video

This website is adapted from this website.