starVLA by starVLA

Modular codebase for developing Vision-Language-Action models

Created 3 months ago

741 stars

Top 46.9% on SourcePulse

Project Summary

Summary

StarVLA is a modular, flexible codebase for developing Vision-Language-Action (VLA) models. It targets researchers and engineers needing rapid prototyping and plug-and-play integration of VLA frameworks, offering a "Lego-like" architecture for swift iteration.

How It Works

Components (model, data, trainer) follow top-down separation with high cohesion and low coupling for easy testing and swapping. StarVLA supports multiple VLA frameworks: Qwen-FAST (autoregressive discrete actions), Qwen-OFT (parallel continuous actions), Qwen-PI (diffusion-based continuous actions), and Qwen-GR00T (dual-system VLA).

Quick Start & Requirements

Setup involves cloning, creating a Python 3.10 conda environment, installing requirements (requirements.txt), FlashAttention2 (flash-attn --no-build-isolation), and the package (pip install -e .). Crucially, FlashAttention2 requires strict alignment between system CUDA toolkit and PyTorch versions. A quick check command is provided: python starVLA/model/framework/QwenGR00T.py. Links to Hugging Face models and SimplerEnv docs are available.

Highlighted Details

VLA Frameworks: Implements Qwen-FAST, Qwen-OFT, Qwen-PI, Qwen-GR00T using Qwen2.5-VL/Qwen3-VL backbones.
Model Zoo: Pretrained checkpoints available on Hugging Face.
Simulation Benchmarks: Supports SimplerEnV, LIBERO; Robocasa, RLBench, etc., are planned.
Training Strategies: Includes Imitation Learning, Multitask Co-training; RL adaptation is upcoming.
Usability: Enables rapid framework prototyping (<3 hours for internal devs, <1 day for new users).

Maintenance & Community

The project incorporates community feedback and encourages contributions via Issues, Discussions, and PRs. A "Cooperation Form" and weekly Friday office hours facilitate collaboration. The codebase is forked from InternVLA-M1, referencing LeRobot, GR00T, DeepSpeed, and Qwen-VL.

Licensing & Compatibility

Released under the MIT License, permitting commercial use, modification, and distribution.

Limitations & Caveats

Several simulation benchmarks and the RL adaptation training strategy are marked "coming soon." FlashAttention2 installation demands careful CUDA/PyTorch version matching. Training resumption does not save optimizer states, impacting restart efficiency.

starVLA by starVLA

Explore Similar Projects

Awesome-VLA-RL by OpenHelix-Team

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

Awesome-VLA-Papers by Psi-Robot

RoboVLMs by Robot-VLAs

vla0 by NVlabs

CogACT by microsoft

OpenDriveVLA by DriveVLA

Agent-R1 by 0russwest0

ShowUI by showlab

VLA-Adapter by OpenHelix-Team

R1-V by StarsfieldAI

Isaac-GR00T by NVIDIA