RDT2 by thu-ml

Foundation model for zero-shot robotic manipulation across embodiments

Created 2 months ago

589 stars

Top 55.1% on SourcePulse

Project Summary

RDT2 enables zero-shot cross-embodiment generalization for robotic manipulation tasks, allowing robots to execute instructions on unseen embodiments without retraining. This project targets researchers and engineers in robotics, offering a versatile foundation model for diverse robotic platforms, thereby reducing the need for extensive task-specific fine-tuning.

How It Works

The project features two primary models: RDT2-VQ, an auto-regressive Vision-Language-Action (VLA) model adapted from Qwen2.5-VL-7B-Instruct using Residual VQ for action tokenization, and RDT2-FM, an optimized RDT model serving as a low-latency action expert via a flow-matching objective. These models are trained on an extensive dataset comprising over 10,000 hours of human manipulation data across more than 100 indoor scenes, facilitating robust generalization capabilities.

Quick Start & Requirements

Installation involves cloning the repository, creating a Python 3.10 conda environment, and installing PyTorch (CUDA 12.8), flash-attn, and other dependencies via requirements.txt. Specific robot integrations require additional packages. Hardware demands an NVIDIA GPU with at least 16GB VRAM for inference (RTX 4090 recommended) and 32GB+ VRAM for fine-tuning (A100/H100 for RDT2-VQ LoRA/full). The system is tested on Ubuntu 24.04. Critical setup includes acquiring designated end effectors, cameras, and performing detailed

RDT2 by thu-ml

Explore Similar Projects

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

vla0 by NVlabs

Awesome-VLA-Papers by Psi-Robot

Hybrid-VLA by PKU-HMI-Lab

Instruct2Act by OpenGVLab

CogACT by microsoft

UniVLA by OpenDriveLab

cliport by cliport

VoxPoser by huangwl18

octo by octo-models

RLBench by stepjam

Isaac-GR00T by NVIDIA