DexVLA  by juruobenruo

Visuomotor policy learning with plug-in diffusion expert

created 5 months ago
428 stars

Top 70.3% on sourcepulse

GitHubView on GitHub
Project Summary

DexVLA is a vision-language model designed for visuomotor policy learning in robotics. It integrates a powerful Vision-Language Model (VLM) backbone with a plug-in diffusion expert, enabling general robot control. The project targets researchers and engineers working on robotic manipulation and imitation learning.

How It Works

DexVLA leverages the Qwen2-VL-2B model as its VLM backbone, providing robust vision-language understanding without further VLM fine-tuning. For policy learning, it utilizes a diffusion-based expert, specifically ScaleDP, which can be scaled to 1 billion parameters. This modular design allows for flexibility in choosing the diffusion policy head and supports a staged training approach for optimal performance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n dexvla python=3.10), activate it, install requirements (pip install -r requirements.txt), and install the policy_heads package (cd policy_heads && pip install -e .). For acceleration, install flash-attn.
  • Data: Data must be in h5py format, compatible with the act project. A conversion script from rlds to h5py is provided.
  • Pretrained Weights: Download official Qwen2-VL weights and replace config.json with the provided one. Download ScaleDP-H weights (Stage 1).
  • Training: Scripts scripts/stage2_train.sh and scripts/stage3_train.sh are available. Requires specifying output directories, task names, and paths to pretrained weights.
  • Evaluation: Requires checkpoints with preprocessor_config.json and chat_template.json.

Highlighted Details

  • Integrates Qwen2-VL-2B for vision-language understanding.
  • Uses ScaleDP, a diffusion policy model, as the plug-in expert.
  • Supports staged training (Stage 2 and Stage 3).
  • Offers memory-saving techniques: DeepSpeed offload, LoRA fine-tuning for the VLM, and smaller ScaleDP models.
  • Based on LLaVA and act-plus-plus projects.

Maintenance & Community

The project is actively updated with recent news regarding training scripts and model releases (e.g., ScaleDP-H). Links to papers for DexVLA, Diffusion-VLA, and ScaleDP are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, its dependencies on LLaVA and act-plus-plus suggest potential licensing considerations for commercial use.

Limitations & Caveats

A TypeError related to _batch_encode_plus with images argument is noted, with a workaround involving copying specific JSON files. CUDA OOM issues are common, with provided solutions including DeepSpeed offload, LoRA, and smaller models. A bug causing random actions during evaluation has been fixed in a recent commit. Precision overflow in DDIMScheduler can lead to NaN action values during inference, requiring a code modification.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
7
Star History
131 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.