calvin - Open-Source Benchmark for Long-Horizon Language-Conditioned Robotic Tasks

Overview of CALVIN Project

The CALVIN project stands for Composing Actions from Language and Vision, and it is an open-source benchmark designed to facilitate the learning of long-horizon, language-conditioned robotics tasks. Under the guidance of respected researchers like Oier Mees, Lukas Hermann, Erick Rosete, and Wolfram Burgard, CALVIN aims to advance the field of robotic manipulation by using human language inputs to control robotic actions, enabling robots to perform complex tasks over extended periods.

Key Features

Open-Source and Award-Winning

CALVIN is not only open-source, making it accessible to researchers and developers worldwide, but it also received recognition by winning the 2022 IEEE Robotics and Automation Letters Best Paper Award.

Language-Conditioned Policy Learning

The project focuses on helping agents learn to execute robotic tasks based on language instructions. This approach allows for a more natural way of communicating tasks to robots from a user-friendly perspective.

Complexity and Flexibility

CALVIN supports complex task sets characterized by lengthy sequences, multi-faceted actions, and diverse language inputs. The benchmark allows flexibility in choosing sensor configurations, creating a diverse environment for testing and development.

Getting Started

Installation

To start using CALVIN, one must clone the repository and set it up by following a few installation steps. It utilizes Python 3.8 and requires certain dependencies, which can be installed using the provided scripts. For those who might face issues with specific dependencies, there are straightforward solutions included in the setup instructions.

Dataset and Training

CALVIN offers a choice of datasets depending on the user's specific requirements, providing different splits of data. For training, the project includes baseline agents and scripts for training custom models. The process involves setting up datasets in shared memory to expedite training and offers flexibility to scale up using multi-GPU setups if required.

Sensory Observations and Action Spaces

CALVIN supports several sensor inputs and action space configurations:

Sensory Inputs: These include RGB images and depth maps from static cameras and gripper-mounted cameras, tactile images, and proprioceptive state readings.
Action Spaces: The benchmark supports action spaces such as absolute Cartesian poses, relative Cartesian displacements, and joint actions, providing comprehensive options for robotic control.

Evaluation: The CALVIN Challenge

Long-Horizon Multi-Task Language Control

The benchmark aims to evaluate the agent's ability to execute long-horizon tasks using language inputs. It encompasses tasks like opening drawers or picking and placing objects. Trained agents can be evaluated using specific commands that load model checkpoints and run tests based on predefined criteria.

Custom Model and Language Evaluation

Users can integrate their models and language embeddings into CALVIN for evaluation purposes. They are provided with scripts to help them test and assess the performance of their custom models against CALVIN's challenges.

State-of-the-Art (SOTA) Models and Reinforcement Learning

The CALVIN benchmark supports various SOTA models accessible through an established leaderboard. Additionally, it offers resources for experimenting with reinforcement learning (RL) in the CALVIN environment.

Frequently Asked Questions

EGL Rendering and Multi-GPU Issues

CALVIN employs EGL for GPU rendering to enhance performance during simulation. Various solutions are provided for potential compatibility issues, especially regarding multi-GPU setups in cluster environments.

Teleoperation and Custom Data Recording

While not explicitly documented, CALVIN supports recording new demonstration data using methods like VR teleoperation, which offers unique flexibility for researchers experimenting with different task recordings.

CALVIN continues to evolve under a versatile, open-source MIT license, inviting innovation and advancement in robotic control systems conditioned by natural language.