Universal RoboAgent exhibiting its skills across diverse tasks in unseen scenarios
RoboAgent can efficiently acquire a wide diversity of non-trivial skills and can generalize them to diverse unseen scenarios.
Trained merely on 7500 trajectories, we are demonstrating a universal RoboAgent that can exhibit a diverse set of 12 non-trivial manipulation skills (beyond picking/pushing, including articulated object manipulation and object re-orientation) across 38 tasks and can generalize them to 100s of diverse unseen scenarios (involving unseen objects, unseen tasks, and to completely unseen kitchens). RoboAgent can also evolve its capabilities with new experiences.
Building a robotic agent that can generalize to many different scenarios requires a dataset with broad coverage. With the recognition that scaling efforts will generally help (e.g. RT-1 presents results with ~130,000 robot trajectories), our goal is to understand the principles of efficiency and generalization in learning system under a data budget. Low data regimes often results in over-fitting. Our main aim is to thus develop a powerful paradigms that can learn a generalizable universal policy while avoiding overfitting in this low-data regime.
The dataset RoboSet(MT-ACT) used for training RoboAgent consists of merely 7,500 trajectories (18x less data than RT1). The dataset was collected ahead of time, and was kept frozen. It consists of high quality (mostly successful) trajectories collected using human teleoperation on commodity robotics hardware (Franka-Emika robots with Robotiq gripper) across multiple tasks and scenes. RoboSet(MT-ACT) sparsely covers 12 unique skills in a few different contexts. It was collected by dividing everyday kitchen activities (e.g. making tea, baking) into different sub-tasks, each representing a unique skill. The dataset includes common pick-place skills but also includes contact-rich skills such as wipe, cap as well as skills involving articulated objects.
In addition to the RoboSet(MT-ACT) we use for training RoboAgent, we are also releasing RoboSet a much larger dataset collected over the course of a few related project containing a total of 100,050 trajectories, including non-kitchen scenes. We are open-sourcing our entire RoboSet to facilitate and accelerate open-source research in robot-learning.
Figure on the right compares our proposed MT-ACT policy representation against several imitation learning architectures. For this result we use environment variations that include object pose changes and some lighting changes only. Somewhat similar to previous works, we refer to this as L1-generalization. From our results we can clearly see that using action-chunking to model sub-trajectories significantly outperforms all baselines, thereby reinforcing the effectiveness of our proposed policy representation for sample efficient learning.
Above figure shows the different levels of generalization we test our approach on. We visualize levels of generalization, L1 with object pose changes, L2 with diverse table backgrounds and distractors and L3 with novel skill-object combinations. Next we show how each method performs on these levels of generalization. In a rigorous evaluation study under, we observe that MT-ACT significantly outperforms all other methods especially on harder generalization levels (L3).
Next we evaluate how RoboAgent performs with increasing levels of semantic augmentations. We evaluate this on one activity (5-skills). Below figure shows that with increased data (i.e. more augmentations per frame) the performance significantly improves across all generalization levels. Importantly, the performance increase is much larger for the harder tasks (L3 generalization).
RoboAgent is a convergence of multiple research thread
(GenAug,
CACTI,
ACT)
spanning two years, and a starting point at the same time for many more future research
directions. During the course of its developments we have also been heavily
inspired by and have learned a lot from many recent generalizable robot learning projects. For
further reading in this space, refer to --
RT-1 (and much more recent RT-2) which studies generalization in
robot learning at scale with large demonstration dataset in association with large langugage
models. Differently, RoboCat uses an
iterative learning and data-generalization pipeline for fast adaptation.
Recent works have also shown advantages on using more efficient policy representations for
multi-modal data using either action chunking
or diffusion-models
Reuss et al., Chi et al.). Finally, recent work such
as ROSIE, GenAug, CACTI have also used open-world
object-detection based methods for semantic augmentations, and other related works
(e.g. R3M,
H2R,
VRB)
have investigated different ways of combining largely passive learning with some active
fine-tuning.
@misc{bharadhwaj2023roboagent,
title={RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking},
author={Homanga Bharadhwaj and Jay Vakil and Mohit Sharma and Abhinav Gupta and Shubham Tulsiani and Vikash Kumar},
year={2023},
eprint={2309.01918},
archivePrefix={arXiv},
primaryClass={cs.RO}
}