vla0 by NVlabs

State-of-the-art Vision-Language-Action models via text-based action representation

Created 5 months ago

457 stars

Top 66.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Summary

VLA-0 offers a novel, simplified approach to building state-of-the-art Vision-Language Action (VLA) models for robot manipulation. It targets researchers and engineers, enabling superior performance on benchmarks and real-world tasks without modifying the base Vision-Language Model (VLM) or requiring extensive robotics pretraining.

How It Works

This project explores representing robot actions directly as text, a largely unexplored strategy. VLA-0 leverages existing VLMs like Qwen2.5-VL-3B without architectural changes or special action tokens, treating actions as text outputs. This "zero modification" approach simplifies VLA development and surprisingly yields superior performance compared to methods that alter VLM vocabularies or introduce action heads, achieving SOTA on LIBERO and real-world tests.

Quick Start & Requirements

Installation: Clone repo (git clone --recurse-submodules), create/activate conda env (conda create -n vla0 python=3.10, conda activate vla0), install with extras (PIP_REQ_EXTRAS=qwen,libero pip install --no-build-isolation -e ".[qwen,libero]"). RoboVerse library requires separate install (cd libs/RoboVerse && PIP_REQ_EXTRAS=lerobot pip install --no-build-isolation -e ".[lerobot]" && cd ../..).
Prerequisites: Python 3.10, Conda, Qwen2.5-VL-3B, LIBERO support, LeRobot datasets (v0.1 tested).
Links: Paper/Website: https://vla0.github.io/

Highlighted Details

Best LIBERO performance without large-scale pretraining (94.7% success rate).
Outperforms models trained on extensive robotics data.
Superior real-world performance (+12.5% over SmolVLA on SO-100 robot).
Requires no architectural changes to the base VLM.

Maintenance & Community

Developed by Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Bulkis, and Fabio Ramos (NVIDIA). Community contributions are welcomed. Direct contact: ankgoyal@umich.edu. No specific community channels or roadmap links are provided.

Licensing & Compatibility

Code and model released under CC BY-NC 4.0 (non-commercial use). Subject to Qwen Research License for the base model. Commercial adoption requires careful review of both licenses.

Limitations & Caveats

Potential improvements include TensorRT-LLM integration for faster inference (targeting 6 Hz from 4 Hz) and lower precision deployment (e.g., INT8) for speed. Compatibility with newer LeRobot versions is unvalidated, and direct LeRobot integrations could be simplified.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

20 stars in the last 30 days