RoboticsDiffusionTransformer by thu-ml

Diffusion Transformer for bimanual robot manipulation

Created 1 year ago

1,587 stars

Top 26.1% on SourcePulse

Project Summary

RDT-1B is a large-scale Diffusion Transformer foundation model for bimanual robotic manipulation, designed for researchers and engineers working on advanced robotics. It enables robots to perform complex tasks based on language instructions and visual input, offering state-of-the-art performance in dexterity and generalization.

How It Works

RDT-1B utilizes a Diffusion Transformer architecture, pre-trained on over 1 million multi-robot episodes. This approach allows it to predict a sequence of 64 robot actions from language instructions and multi-view RGB images. The model's design is inherently flexible, supporting various robot configurations (single/dual-arm, joint/EEF control, position/velocity) and even wheeled locomotion.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (conda create -n rdt python=3.10.0), activate it, install PyTorch (matching CUDA version, e.g., pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121), packaging==24.0, flash-attn, and other requirements (pip install -r requirements.txt).
Prerequisites: CUDA 12.1 (recommended), Python 3.10, PyTorch 2.1.0, flash-attn. Requires downloading off-the-shelf encoders (T5-v1.1-XXL, SigLIP) and linking them. Fine-tuning requires a dataset buffer of at least 400GB.
Resources: The T5-XXL encoder can be VRAM-intensive; pre-computing language embeddings is recommended for GPUs with less than 24GB VRAM.
Links: Paper, Project Page, Model, Data.

Highlighted Details

1 billion parameters, the largest diffusion model for robotics to date.
Pre-trained on 1M+ multi-robot episodes, enabling broad generalization.
Achieves state-of-the-art performance on bimanual manipulation tasks.
Supports fine-tuning on custom datasets and deployment on real robots.

Maintenance & Community

The project is from thu-ml, with notable contributors listed in the paper. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and closed-source linking.

Limitations & Caveats

The T5-XXL language encoder requires significant VRAM; users with limited GPU memory must pre-compute embeddings or use smaller models. Fine-tuning requires careful dataset preparation and implementation of custom dataset loaders. The README notes that EEF rotation mapping to 6D representation is not reversible.

RoboticsDiffusionTransformer by thu-ml

Explore Similar Projects

Awesome-Robotics-Diffusion by showlab

vla0 by NVlabs

embodied-agents by mbodiai

SpatialVLA by SpatialVLA

CogACT by microsoft

GR00T-Dreams by NVIDIA

peract by peract

VIMA by vimalabs

Awesome-Robotics-Foundation-Models by robotics-survey

3D-Diffusion-Policy by YanjieZe

octo by octo-models

openpi by Physical-Intelligence