RoboticsDiffusionTransformer  by thu-ml

Diffusion Transformer for bimanual robot manipulation

created 10 months ago
1,342 stars

Top 30.5% on sourcepulse

GitHubView on GitHub
Project Summary

RDT-1B is a large-scale Diffusion Transformer foundation model for bimanual robotic manipulation, designed for researchers and engineers working on advanced robotics. It enables robots to perform complex tasks based on language instructions and visual input, offering state-of-the-art performance in dexterity and generalization.

How It Works

RDT-1B utilizes a Diffusion Transformer architecture, pre-trained on over 1 million multi-robot episodes. This approach allows it to predict a sequence of 64 robot actions from language instructions and multi-view RGB images. The model's design is inherently flexible, supporting various robot configurations (single/dual-arm, joint/EEF control, position/velocity) and even wheeled locomotion.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (conda create -n rdt python=3.10.0), activate it, install PyTorch (matching CUDA version, e.g., pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121), packaging==24.0, flash-attn, and other requirements (pip install -r requirements.txt).
  • Prerequisites: CUDA 12.1 (recommended), Python 3.10, PyTorch 2.1.0, flash-attn. Requires downloading off-the-shelf encoders (T5-v1.1-XXL, SigLIP) and linking them. Fine-tuning requires a dataset buffer of at least 400GB.
  • Resources: The T5-XXL encoder can be VRAM-intensive; pre-computing language embeddings is recommended for GPUs with less than 24GB VRAM.
  • Links: Paper, Project Page, Model, Data.

Highlighted Details

  • 1 billion parameters, the largest diffusion model for robotics to date.
  • Pre-trained on 1M+ multi-robot episodes, enabling broad generalization.
  • Achieves state-of-the-art performance on bimanual manipulation tasks.
  • Supports fine-tuning on custom datasets and deployment on real robots.

Maintenance & Community

The project is from thu-ml, with notable contributors listed in the paper. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and closed-source linking.

Limitations & Caveats

The T5-XXL language encoder requires significant VRAM; users with limited GPU memory must pre-compute embeddings or use smaller models. Fine-tuning requires careful dataset preparation and implementation of custom dataset loaders. The README notes that EEF rotation mapping to 6D representation is not reversible.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
202 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.