Diffusion Transformer for bimanual robot manipulation
Top 30.5% on sourcepulse
RDT-1B is a large-scale Diffusion Transformer foundation model for bimanual robotic manipulation, designed for researchers and engineers working on advanced robotics. It enables robots to perform complex tasks based on language instructions and visual input, offering state-of-the-art performance in dexterity and generalization.
How It Works
RDT-1B utilizes a Diffusion Transformer architecture, pre-trained on over 1 million multi-robot episodes. This approach allows it to predict a sequence of 64 robot actions from language instructions and multi-view RGB images. The model's design is inherently flexible, supporting various robot configurations (single/dual-arm, joint/EEF control, position/velocity) and even wheeled locomotion.
Quick Start & Requirements
conda create -n rdt python=3.10.0
), activate it, install PyTorch (matching CUDA version, e.g., pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
), packaging==24.0
, flash-attn
, and other requirements (pip install -r requirements.txt
).flash-attn
. Requires downloading off-the-shelf encoders (T5-v1.1-XXL, SigLIP) and linking them. Fine-tuning requires a dataset buffer of at least 400GB.Highlighted Details
Maintenance & Community
The project is from thu-ml, with notable contributors listed in the paper. Community channels are not explicitly mentioned in the README.
Licensing & Compatibility
Limitations & Caveats
The T5-XXL language encoder requires significant VRAM; users with limited GPU memory must pre-compute embeddings or use smaller models. Fine-tuning requires careful dataset preparation and implementation of custom dataset loaders. The README notes that EEF rotation mapping to 6D representation is not reversible.
1 month ago
1 week