starVLA  by starVLA

Modular codebase for developing Vision-Language-Action models

Created 7 months ago
2,576 stars

Top 17.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

StarVLA is a modular, flexible codebase for developing Vision-Language-Action (VLA) models. It targets researchers and engineers needing rapid prototyping and plug-and-play integration of VLA frameworks, offering a "Lego-like" architecture for swift iteration.

How It Works

Components (model, data, trainer) follow top-down separation with high cohesion and low coupling for easy testing and swapping. StarVLA supports multiple VLA frameworks: Qwen-FAST (autoregressive discrete actions), Qwen-OFT (parallel continuous actions), Qwen-PI (diffusion-based continuous actions), and Qwen-GR00T (dual-system VLA).

Quick Start & Requirements

Setup involves cloning, creating a Python 3.10 conda environment, installing requirements (requirements.txt), FlashAttention2 (flash-attn --no-build-isolation), and the package (pip install -e .). Crucially, FlashAttention2 requires strict alignment between system CUDA toolkit and PyTorch versions. A quick check command is provided: python starVLA/model/framework/QwenGR00T.py. Links to Hugging Face models and SimplerEnv docs are available.

Highlighted Details

  • VLA Frameworks: Implements Qwen-FAST, Qwen-OFT, Qwen-PI, Qwen-GR00T using Qwen2.5-VL/Qwen3-VL backbones.
  • Model Zoo: Pretrained checkpoints available on Hugging Face.
  • Simulation Benchmarks: Supports SimplerEnV, LIBERO; Robocasa, RLBench, etc., are planned.
  • Training Strategies: Includes Imitation Learning, Multitask Co-training; RL adaptation is upcoming.
  • Usability: Enables rapid framework prototyping (<3 hours for internal devs, <1 day for new users).

Maintenance & Community

The project incorporates community feedback and encourages contributions via Issues, Discussions, and PRs. A "Cooperation Form" and weekly Friday office hours facilitate collaboration. The codebase is forked from InternVLA-M1, referencing LeRobot, GR00T, DeepSpeed, and Qwen-VL.

Licensing & Compatibility

Released under the MIT License, permitting commercial use, modification, and distribution.

Limitations & Caveats

Several simulation benchmarks and the RL adaptation training strategy are marked "coming soon." FlashAttention2 installation demands careful CUDA/PyTorch version matching. Training resumption does not save optimizer states, impacting restart efficiency.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
32
Issues (30d)
27
Star History
502 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Max Liu Max Liu(Cofounder of PingCAP), and
2 more.

ShowUI by showlab

0.4%
2k
Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.