Discover and explore top open-source AI tools and projects—updated daily.
OpenBMBVision-Language-Action models enhanced with explicit deliberation
Top 64.4% on SourcePulse
DeepThinkVLA enhances Vision-Language-Action (VLA) models by introducing explicit deliberation through a hybrid decoder and Chain-of-Thought (CoT) reasoning. It targets researchers and engineers in embodied AI, offering improved task success rates, efficient inference, and self-correction capabilities for complex robotic manipulation tasks.
How It Works
The project refactors VLA policies into a 2.9B parameter hybrid decoder that first generates an autoregressive reasoning trace before emitting action chunks in parallel. This architecture cleanly separates deliberation from action, resolving modality conflicts inherent in single-decoder baselines. A two-stage CoT dataset engine and a training pipeline combining supervised fine-tuning (SFT) with outcome-driven reinforcement learning (RL) further refine performance. This approach yields significant gains in success rate and reduces inference latency compared to naive autoregressive methods.
Quick Start & Requirements
Installation involves creating a Conda environment (python=3.10), activating it, and running pip install -r requirements.txt. A specific fix for egl_probe requires installing cmake==3.31.6 and a patched wheel. Prerequisites include Linux/WSL, NVIDIA GPUs with CUDA 12.x, and Python >= 3.10. Full SFT training demands substantial resources (>= 8x80GB GPUs), while RL requires a multi-node setup. Links to data, checkpoints, and training scripts are provided via Hugging Face CLI commands and shell scripts.
Highlighted Details
Maintenance & Community
The project acknowledges contributions from various open-source components and communities, including Hugging Face Transformers, PEFT, DeepSpeed, LeRobot, LIBERO, and VERL. No direct community channels (like Discord or Slack) or explicit maintainer information beyond the author list in the citation are provided.
Licensing & Compatibility
The repository includes a LICENSE file, but its specific terms and compatibility for commercial use are not detailed within the provided README content.
Limitations & Caveats
Future work includes integrating the RobotWin benchmark and conducting real-world hardware experiments. The setup is resource-intensive, particularly for full SFT training, and may require specific dependency workarounds. The project appears to be under active development, with several items listed under TODO.
1 week ago
Inactive
allenzren