vla-scratch  by EGalahad

Performant stack for Vision-Language-Action model training and serving

Created 1 month ago
297 stars

Top 89.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

VLA-Scratch offers a modular, performant, and efficient stack for training, evaluating, and serving Vision-Language-Action (VLA) models. It targets engineers and researchers, aiming to make VLA model development fast and approachable by minimizing dependencies and optimizing performance.

How It Works

The stack uses TensorClass for explicit data boundaries, ensuring a typed, modular codebase that facilitates heterogeneous dataset co-training and clear data flow. Performance is enhanced by optimizing VLM forward passes to eliminate host-device syncs, leveraging native PyTorch operations like FSDP2 and gradient checkpointing for dedicated tuning. Experimentation is streamlined via a Hydra workflow, allowing seamless registration and overriding of configurations with shared grammar across training, evaluation, and serving scripts.

Quick Start & Requirements

Environment setup uses uv: GIT_LFS_SKIP_SMUDGE=1 uv sync. Commands are provided for training (uv run torchrun ... scripts/train_policy.py), evaluation (uv run scripts/eval_policy.py), and serving (uv run scripts/serve_policy.py). Dependencies include torchrun, wandb, and pyav. A note addresses potential RTX 5090/CUDA 12.8 compatibility issues with stable PyTorch, recommending PyTorch-Nightly. Further details are in scripts/README.md and examples/libero.

Highlighted Details

  • Explicit TensorClass data model enables composable modules and heterogeneous dataset co-training.
  • Dedicated performance tuning via native PyTorch operations (FSDP2, gradient checkpointing) bypasses generic libraries.
  • Supports multi-source dataset co-training (VQA, robotics) and multiple VLM backbones (Qwen3-VL, PaliGemma 1/2, SmolVLM).
  • Includes simulation-ready checkpoints and serving scripts.
  • Hydra workflow simplifies experimentation with easily overrideable configurations.

Maintenance & Community

No specific details regarding contributors, community channels, or roadmaps are present in the provided README.

Licensing & Compatibility

The README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Users with RTX 5090 GPUs and CUDA 12.8 may face issues with stable PyTorch, requiring PyTorch-Nightly. The troubleshooting section is marked "To be Continued." The absence of a specified license is a significant adoption caveat.

Health Check
Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
2
Star History
249 stars in the last 30 days

Explore Similar Projects

Starred by Lukas Biewald Lukas Biewald(Cofounder of Weights & Biases), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

DialoGPT by microsoft

0.1%
2k
Response generation model via large-scale pretraining
Created 6 years ago
Updated 3 years ago
Feedback? Help us improve.