DeepThinkVLA by OpenBMB

Vision-Language-Action models enhanced with explicit deliberation

Created 5 months ago

499 stars

Top 62.4% on SourcePulse

Project Summary

DeepThinkVLA enhances Vision-Language-Action (VLA) models by introducing explicit deliberation through a hybrid decoder and Chain-of-Thought (CoT) reasoning. It targets researchers and engineers in embodied AI, offering improved task success rates, efficient inference, and self-correction capabilities for complex robotic manipulation tasks.

How It Works

The project refactors VLA policies into a 2.9B parameter hybrid decoder that first generates an autoregressive reasoning trace before emitting action chunks in parallel. This architecture cleanly separates deliberation from action, resolving modality conflicts inherent in single-decoder baselines. A two-stage CoT dataset engine and a training pipeline combining supervised fine-tuning (SFT) with outcome-driven reinforcement learning (RL) further refine performance. This approach yields significant gains in success rate and reduces inference latency compared to naive autoregressive methods.

Quick Start & Requirements

Installation involves creating a Conda environment (python=3.10), activating it, and running pip install -r requirements.txt. A specific fix for egl_probe requires installing cmake==3.31.6 and a patched wheel. Prerequisites include Linux/WSL, NVIDIA GPUs with CUDA 12.x, and Python >= 3.10. Full SFT training demands substantial resources (>= 8x80GB GPUs), while RL requires a multi-node setup. Links to data, checkpoints, and training scripts are provided via Hugging Face CLI commands and shell scripts.

Highlighted Details

Achieves a 97.0% average success rate on the LIBERO benchmark, surpassing autoregressive, diffusion, and parallel-decoding baselines.
The hybrid decoder architecture alone improves success rates by 15.5 percentage points over naive autoregressive CoT variants.
Outcome-based RL refinement boosts performance on long-horizon tasks, increasing LIBERO-Long success from 94.2% to 96.2%.
Masked-CoT inference maintains high accuracy (96.5% average SR) while operating at 0.175x the latency of the pi0-FAST baseline.

Maintenance & Community

The project acknowledges contributions from various open-source components and communities, including Hugging Face Transformers, PEFT, DeepSpeed, LeRobot, LIBERO, and VERL. No direct community channels (like Discord or Slack) or explicit maintainer information beyond the author list in the citation are provided.

Licensing & Compatibility

The repository includes a LICENSE file, but its specific terms and compatibility for commercial use are not detailed within the provided README content.

Limitations & Caveats

Future work includes integrating the RobotWin benchmark and conducting real-world hardware experiments. The setup is resource-intensive, particularly for full SFT training, and may require specific dependency workarounds. The project appears to be under active development, with several items listed under TODO.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days