cliport by cliport

Robotic manipulation via imitation learning using language-conditioned policies

Created 4 years ago

537 stars

Top 59.1% on SourcePulse

Project Summary

CLIPort is an end-to-end imitation learning agent for robotic manipulation tasks, designed to learn a single, generalizable policy from limited demonstrations. It combines CLIP's semantic understanding with TransporterNets' spatial precision, enabling robots to perform tasks based on natural language instructions. The target audience includes robotics researchers and engineers seeking to develop more adaptable and language-aware robotic systems.

How It Works

CLIPort integrates CLIP's visual-language understanding with TransporterNets' spatial reasoning. It uses CLIP to interpret natural language commands (the "what") and TransporterNets to predict precise end-effector movements (the "where"). This dual-pathway approach allows for learning generalizable skills from a small number of demonstrations, bridging the gap between high-level semantic goals and low-level robotic actions.

Quick Start & Requirements

Installation: Clone the repository, set up a virtual environment (Python 3.8 recommended), install requirements (pip install -r requirements.txt), and develop the package (python setup.py develop).
Prerequisites: NVIDIA GPU with 8.5-9.5GB memory, CUDA-compatible PyTorch (v1.7.1) and torchvision (v0.8.2).
Quickstart: Download pre-trained models (sh scripts/quickstart_download.sh), generate test data (python cliport/demos.py), and evaluate (python cliport/eval.py).
Resources: Full dataset requires ~1.6TB storage. Pre-trained checkpoints are available via Google Drive.
Docs: cliport.github.io

Highlighted Details

Learns a single policy for multiple tabletop manipulation tasks.
Leverages CLIP for language understanding and TransporterNets for spatial prediction.
Supports both single-task and multi-task training from scratch.
Includes scripts for dataset generation, training, evaluation, and video recording.

Maintenance & Community

The project was presented at CoRL 2021. Issues can be filed via the GitHub issue tracker.

Licensing & Compatibility

The project incorporates code from Google Ravens (Apache 2.0), OpenAI CLIP (MIT), and Pytorch-UNet (GPL 3.0). The GPL 3.0 license for the UNet component may impose restrictions on commercial use or linking in closed-source projects.

Limitations & Caveats

The code is noted as "Tired grad student" quality and currently only supports a batch size of 1 due to memory constraints. Rotation augmentation may cause issues for tasks requiring precise spatial relationships. Multi-task models are not trained on the full seen object split for certain tasks, potentially disadvantaging them compared to single-task models. The README suggests averaging evaluation metrics over multiple runs with different seeds, which is not automated.

cliport by cliport

Explore Similar Projects

Hybrid-VLA by PKU-HMI-Lab

Instruct2Act by OpenGVLab

scalingup by real-stanford

RDT2 by thu-ml

molmoact by allenai

RoboFlamingo by RoboFlamingo

CogACT by microsoft

GR00T-Dreams by NVIDIA

peract by peract

VIMA by vimalabs

RoboGen by Genesis-Embodied-AI

RLBench by stepjam