DexVLA by juruobenruo

Visuomotor policy learning with plug-in diffusion expert

Created 11 months ago

512 stars

Top 61.1% on SourcePulse

Project Summary

DexVLA is a vision-language model designed for visuomotor policy learning in robotics. It integrates a powerful Vision-Language Model (VLM) backbone with a plug-in diffusion expert, enabling general robot control. The project targets researchers and engineers working on robotic manipulation and imitation learning.

How It Works

DexVLA leverages the Qwen2-VL-2B model as its VLM backbone, providing robust vision-language understanding without further VLM fine-tuning. For policy learning, it utilizes a diffusion-based expert, specifically ScaleDP, which can be scaled to 1 billion parameters. This modular design allows for flexibility in choosing the diffusion policy head and supports a staged training approach for optimal performance.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n dexvla python=3.10), activate it, install requirements (pip install -r requirements.txt), and install the policy_heads package (cd policy_heads && pip install -e .). For acceleration, install flash-attn.
Data: Data must be in h5py format, compatible with the act project. A conversion script from rlds to h5py is provided.
Pretrained Weights: Download official Qwen2-VL weights and replace config.json with the provided one. Download ScaleDP-H weights (Stage 1).
Training: Scripts scripts/stage2_train.sh and scripts/stage3_train.sh are available. Requires specifying output directories, task names, and paths to pretrained weights.
Evaluation: Requires checkpoints with preprocessor_config.json and chat_template.json.

Highlighted Details

Integrates Qwen2-VL-2B for vision-language understanding.
Uses ScaleDP, a diffusion policy model, as the plug-in expert.
Supports staged training (Stage 2 and Stage 3).
Offers memory-saving techniques: DeepSpeed offload, LoRA fine-tuning for the VLM, and smaller ScaleDP models.
Based on LLaVA and act-plus-plus projects.

Maintenance & Community

The project is actively updated with recent news regarding training scripts and model releases (e.g., ScaleDP-H). Links to papers for DexVLA, Diffusion-VLA, and ScaleDP are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, its dependencies on LLaVA and act-plus-plus suggest potential licensing considerations for commercial use.

Limitations & Caveats

A TypeError related to _batch_encode_plus with images argument is noted, with a workaround involving copying specific JSON files. CUDA OOM issues are common, with provided solutions including DeepSpeed offload, LoRA, and smaller models. A bug causing random actions during evaluation has been fixed in a recent commit. Precision overflow in DDIMScheduler can lead to NaN action values during inference, requiring a code modification.

DexVLA by juruobenruo

Explore Similar Projects

awesome-in-context-rl by dunnolab

VLABench by OpenMOSS

RoboVLMs by Robot-VLAs

DiffuLLaMA by HKUNLP

scalingup by real-stanford

simlingo by RenzKa

UniVLA by OpenDriveLab

Vary by Ucas-HaoranWei

VLA-Adapter by OpenHelix-Team

RLinf by RLinf

R1-V by StarsfieldAI

Isaac-GR00T by NVIDIA