Discover and explore top open-source AI tools and projects—updated daily.
starVLAModular codebase for developing Vision-Language-Action models
New!
Top 78.2% on SourcePulse
Summary
StarVLA is a modular, flexible codebase for developing Vision-Language-Action (VLA) models. It targets researchers and engineers needing rapid prototyping and plug-and-play integration of VLA frameworks, offering a "Lego-like" architecture for swift iteration.
How It Works
Components (model, data, trainer) follow top-down separation with high cohesion and low coupling for easy testing and swapping. StarVLA supports multiple VLA frameworks: Qwen-FAST (autoregressive discrete actions), Qwen-OFT (parallel continuous actions), Qwen-PI (diffusion-based continuous actions), and Qwen-GR00T (dual-system VLA).
Quick Start & Requirements
Setup involves cloning, creating a Python 3.10 conda environment, installing requirements (requirements.txt), FlashAttention2 (flash-attn --no-build-isolation), and the package (pip install -e .). Crucially, FlashAttention2 requires strict alignment between system CUDA toolkit and PyTorch versions. A quick check command is provided: python starVLA/model/framework/QwenGR00T.py. Links to Hugging Face models and SimplerEnv docs are available.
Highlighted Details
Maintenance & Community
The project incorporates community feedback and encourages contributions via Issues, Discussions, and PRs. A "Cooperation Form" and weekly Friday office hours facilitate collaboration. The codebase is forked from InternVLA-M1, referencing LeRobot, GR00T, DeepSpeed, and Qwen-VL.
Licensing & Compatibility
Released under the MIT License, permitting commercial use, modification, and distribution.
Limitations & Caveats
Several simulation benchmarks and the RL adaptation training strategy are marked "coming soon." FlashAttention2 installation demands careful CUDA/PyTorch version matching. Training resumption does not save optimizer states, impacting restart efficiency.
4 days ago
Inactive
NVIDIA