Visuomotor policy learning with plug-in diffusion expert
Top 70.3% on sourcepulse
DexVLA is a vision-language model designed for visuomotor policy learning in robotics. It integrates a powerful Vision-Language Model (VLM) backbone with a plug-in diffusion expert, enabling general robot control. The project targets researchers and engineers working on robotic manipulation and imitation learning.
How It Works
DexVLA leverages the Qwen2-VL-2B model as its VLM backbone, providing robust vision-language understanding without further VLM fine-tuning. For policy learning, it utilizes a diffusion-based expert, specifically ScaleDP, which can be scaled to 1 billion parameters. This modular design allows for flexibility in choosing the diffusion policy head and supports a staged training approach for optimal performance.
Quick Start & Requirements
conda create -n dexvla python=3.10
), activate it, install requirements (pip install -r requirements.txt
), and install the policy_heads
package (cd policy_heads && pip install -e .
). For acceleration, install flash-attn
.act
project. A conversion script from rlds
to h5py
is provided.config.json
with the provided one. Download ScaleDP-H weights (Stage 1).scripts/stage2_train.sh
and scripts/stage3_train.sh
are available. Requires specifying output directories, task names, and paths to pretrained weights.preprocessor_config.json
and chat_template.json
.Highlighted Details
Maintenance & Community
The project is actively updated with recent news regarding training scripts and model releases (e.g., ScaleDP-H). Links to papers for DexVLA, Diffusion-VLA, and ScaleDP are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, its dependencies on LLaVA and act-plus-plus suggest potential licensing considerations for commercial use.
Limitations & Caveats
A TypeError
related to _batch_encode_plus
with images
argument is noted, with a workaround involving copying specific JSON files. CUDA OOM issues are common, with provided solutions including DeepSpeed offload, LoRA, and smaller models. A bug causing random actions during evaluation has been fixed in a recent commit. Precision overflow in DDIMScheduler
can lead to NaN action values during inference, requiring a code modification.
3 months ago
Inactive