vla-scratch by EGalahad

Performant stack for Vision-Language-Action model training and serving

Created 1 month ago

297 stars

Top 89.7% on SourcePulse

Project Summary

Summary

VLA-Scratch offers a modular, performant, and efficient stack for training, evaluating, and serving Vision-Language-Action (VLA) models. It targets engineers and researchers, aiming to make VLA model development fast and approachable by minimizing dependencies and optimizing performance.

How It Works

The stack uses TensorClass for explicit data boundaries, ensuring a typed, modular codebase that facilitates heterogeneous dataset co-training and clear data flow. Performance is enhanced by optimizing VLM forward passes to eliminate host-device syncs, leveraging native PyTorch operations like FSDP2 and gradient checkpointing for dedicated tuning. Experimentation is streamlined via a Hydra workflow, allowing seamless registration and overriding of configurations with shared grammar across training, evaluation, and serving scripts.

Quick Start & Requirements

Environment setup uses uv: GIT_LFS_SKIP_SMUDGE=1 uv sync. Commands are provided for training (uv run torchrun ... scripts/train_policy.py), evaluation (uv run scripts/eval_policy.py), and serving (uv run scripts/serve_policy.py). Dependencies include torchrun, wandb, and pyav. A note addresses potential RTX 5090/CUDA 12.8 compatibility issues with stable PyTorch, recommending PyTorch-Nightly. Further details are in scripts/README.md and examples/libero.

Highlighted Details

Explicit TensorClass data model enables composable modules and heterogeneous dataset co-training.
Dedicated performance tuning via native PyTorch operations (FSDP2, gradient checkpointing) bypasses generic libraries.
Supports multi-source dataset co-training (VQA, robotics) and multiple VLM backbones (Qwen3-VL, PaliGemma 1/2, SmolVLM).
Includes simulation-ready checkpoints and serving scripts.
Hydra workflow simplifies experimentation with easily overrideable configurations.

Maintenance & Community

No specific details regarding contributors, community channels, or roadmaps are present in the provided README.

Licensing & Compatibility

The README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Users with RTX 5090 GPUs and CUDA 12.8 may face issues with stable PyTorch, requiring PyTorch-Nightly. The troubleshooting section is marked "To be Continued." The absence of a specified license is a significant adoption caveat.

vla-scratch by EGalahad

Explore Similar Projects

RoboVLMs by Robot-VLAs

X-VLM by zengyan-97

TinyGPT-V by DLYuanGod

LLaVA-CoT by PKU-YuanGroup

molmo by allenai

starVLA by starVLA

ShowUI by showlab

R1-V by StarsfieldAI

DialoGPT by microsoft

EasyR1 by hiyouga

FlagAI by FlagAI-Open

prismatic-vlms by TRI-ML