starVLA  by starVLA

Modular codebase for developing Vision-Language-Action models

Created 3 weeks ago

New!

357 stars

Top 78.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

StarVLA is a modular, flexible codebase for developing Vision-Language-Action (VLA) models. It targets researchers and engineers needing rapid prototyping and plug-and-play integration of VLA frameworks, offering a "Lego-like" architecture for swift iteration.

How It Works

Components (model, data, trainer) follow top-down separation with high cohesion and low coupling for easy testing and swapping. StarVLA supports multiple VLA frameworks: Qwen-FAST (autoregressive discrete actions), Qwen-OFT (parallel continuous actions), Qwen-PI (diffusion-based continuous actions), and Qwen-GR00T (dual-system VLA).

Quick Start & Requirements

Setup involves cloning, creating a Python 3.10 conda environment, installing requirements (requirements.txt), FlashAttention2 (flash-attn --no-build-isolation), and the package (pip install -e .). Crucially, FlashAttention2 requires strict alignment between system CUDA toolkit and PyTorch versions. A quick check command is provided: python starVLA/model/framework/QwenGR00T.py. Links to Hugging Face models and SimplerEnv docs are available.

Highlighted Details

  • VLA Frameworks: Implements Qwen-FAST, Qwen-OFT, Qwen-PI, Qwen-GR00T using Qwen2.5-VL/Qwen3-VL backbones.
  • Model Zoo: Pretrained checkpoints available on Hugging Face.
  • Simulation Benchmarks: Supports SimplerEnV, LIBERO; Robocasa, RLBench, etc., are planned.
  • Training Strategies: Includes Imitation Learning, Multitask Co-training; RL adaptation is upcoming.
  • Usability: Enables rapid framework prototyping (<3 hours for internal devs, <1 day for new users).

Maintenance & Community

The project incorporates community feedback and encourages contributions via Issues, Discussions, and PRs. A "Cooperation Form" and weekly Friday office hours facilitate collaboration. The codebase is forked from InternVLA-M1, referencing LeRobot, GR00T, DeepSpeed, and Qwen-VL.

Licensing & Compatibility

Released under the MIT License, permitting commercial use, modification, and distribution.

Limitations & Caveats

Several simulation benchmarks and the RL adaptation training strategy are marked "coming soon." FlashAttention2 installation demands careful CUDA/PyTorch version matching. Training resumption does not save optimizer states, impacting restart efficiency.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
26
Star History
361 stars in the last 26 days

Explore Similar Projects

Feedback? Help us improve.