vla0  by NVlabs

State-of-the-art Vision-Language-Action models via text-based action representation

Created 1 month ago
312 stars

Top 86.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

VLA-0 offers a novel, simplified approach to building state-of-the-art Vision-Language Action (VLA) models for robot manipulation. It targets researchers and engineers, enabling superior performance on benchmarks and real-world tasks without modifying the base Vision-Language Model (VLM) or requiring extensive robotics pretraining.

How It Works

This project explores representing robot actions directly as text, a largely unexplored strategy. VLA-0 leverages existing VLMs like Qwen2.5-VL-3B without architectural changes or special action tokens, treating actions as text outputs. This "zero modification" approach simplifies VLA development and surprisingly yields superior performance compared to methods that alter VLM vocabularies or introduce action heads, achieving SOTA on LIBERO and real-world tests.

Quick Start & Requirements

  • Installation: Clone repo (git clone --recurse-submodules), create/activate conda env (conda create -n vla0 python=3.10, conda activate vla0), install with extras (PIP_REQ_EXTRAS=qwen,libero pip install --no-build-isolation -e ".[qwen,libero]"). RoboVerse library requires separate install (cd libs/RoboVerse && PIP_REQ_EXTRAS=lerobot pip install --no-build-isolation -e ".[lerobot]" && cd ../..).
  • Prerequisites: Python 3.10, Conda, Qwen2.5-VL-3B, LIBERO support, LeRobot datasets (v0.1 tested).
  • Links: Paper/Website: https://vla0.github.io/

Highlighted Details

  • Best LIBERO performance without large-scale pretraining (94.7% success rate).
  • Outperforms models trained on extensive robotics data.
  • Superior real-world performance (+12.5% over SmolVLA on SO-100 robot).
  • Requires no architectural changes to the base VLM.

Maintenance & Community

Developed by Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Bulkis, and Fabio Ramos (NVIDIA). Community contributions are welcomed. Direct contact: ankgoyal@umich.edu. No specific community channels or roadmap links are provided.

Licensing & Compatibility

Code and model released under CC BY-NC 4.0 (non-commercial use). Subject to Qwen Research License for the base model. Commercial adoption requires careful review of both licenses.

Limitations & Caveats

Potential improvements include TensorRT-LLM integration for faster inference (targeting 6 Hz from 4 Hz) and lower precision deployment (e.g., INT8) for speed. Compatibility with newer LeRobot versions is unvalidated, and direct LeRobot integrations could be simplified.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
13
Star History
105 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.