Discover and explore top open-source AI tools and projects—updated daily.
juruobenruoVisuomotor policy learning with plug-in diffusion expert
Top 61.0% on SourcePulse
DexVLA is a vision-language model designed for visuomotor policy learning in robotics. It integrates a powerful Vision-Language Model (VLM) backbone with a plug-in diffusion expert, enabling general robot control. The project targets researchers and engineers working on robotic manipulation and imitation learning.
How It Works
DexVLA leverages the Qwen2-VL-2B model as its VLM backbone, providing robust vision-language understanding without further VLM fine-tuning. For policy learning, it utilizes a diffusion-based expert, specifically ScaleDP, which can be scaled to 1 billion parameters. This modular design allows for flexibility in choosing the diffusion policy head and supports a staged training approach for optimal performance.
Quick Start & Requirements
conda create -n dexvla python=3.10), activate it, install requirements (pip install -r requirements.txt), and install the policy_heads package (cd policy_heads && pip install -e .). For acceleration, install flash-attn.act project. A conversion script from rlds to h5py is provided.config.json with the provided one. Download ScaleDP-H weights (Stage 1).scripts/stage2_train.sh and scripts/stage3_train.sh are available. Requires specifying output directories, task names, and paths to pretrained weights.preprocessor_config.json and chat_template.json.Highlighted Details
Maintenance & Community
The project is actively updated with recent news regarding training scripts and model releases (e.g., ScaleDP-H). Links to papers for DexVLA, Diffusion-VLA, and ScaleDP are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, its dependencies on LLaVA and act-plus-plus suggest potential licensing considerations for commercial use.
Limitations & Caveats
A TypeError related to _batch_encode_plus with images argument is noted, with a workaround involving copying specific JSON files. CUDA OOM issues are common, with provided solutions including DeepSpeed offload, LoRA, and smaller models. A bug causing random actions during evaluation has been fixed in a recent commit. Precision overflow in DDIMScheduler can lead to NaN action values during inference, requiring a code modification.
6 months ago
Inactive
allenzren
NVIDIA