UniVLA by OpenDriveLab

Vision-language-action framework for cross-environment policy learning

Created 8 months ago

933 stars

Top 39.2% on SourcePulse

Project Summary

UniVLA is a unified vision-language-action framework designed for learning generalist robotic policies across diverse environments and embodiments. It targets researchers and engineers in robotics and AI who aim to develop adaptable and efficient control systems, offering significant improvements over previous methods like OpenVLA.

How It Works

UniVLA introduces task-centric latent actions, derived unsupervisedly via a VQ-VAE, to create an embodiment-agnostic action space. This approach allows the model to leverage data from various sources without requiring explicit action labels. A generalist policy is then pretrained on this latent action space, followed by lightweight, embodiment-specific action decoders for deployment, enabling efficient fine-tuning and adaptation.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e ..
Prerequisites: Python 3.10, PyTorch 2.2.0 with CUDA 12.1, Flash Attention 2.
Setup: Conda environment setup and PyTorch installation are recommended.
Docs: Paper, Demo Page (Coming Soon).

Highlighted Details

Achieves state-of-the-art performance on LIBERO benchmarks, outperforming models like Diffusion Policy, Octo, OpenVLA, and TraceVLA.
Demonstrates significant computational efficiency, requiring only 5% of the resources used by OpenVLA for full-scale pretraining.
Offers cost-efficient pre-training options for specific datasets (e.g., BridgeV2, Ego4D human videos).
Supports real-world deployment with lightweight action decoders (approx. 12M parameters) and parameter-efficient fine-tuning (LoRA).

Maintenance & Community

Official implementation of a paper published in RSS 2025.
Primary contact: Qingwen Bu (buqingwen@opendrivelab.com).
Code released in May 2025.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Demo page and some specific fine-tuning scripts (e.g., Room2Room, CALVIN, SimplerEnv) are marked as "Coming Soon" or "TODO".
Real-world deployment guidelines are based on the AgiLex platform, requiring adaptation for other systems.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

3

Star History

58 stars in the last 30 days

Explore Similar Projects

Motus by thu-ml

Unified latent action world model for robotics

Created 1 month ago

Updated 6 days ago

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

Advancing robotic manipulation with large Vision-Language-Action models

Created 7 months ago

Updated 3 weeks ago

Awesome-VLA-Papers by Psi-Robot

Vision-Language-Action (VLA) research paper compilation

Created 9 months ago

Updated 6 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

vla0 by NVlabs

State-of-the-art Vision-Language-Action models via text-based action representation

Created 2 months ago

Updated 2 days ago

Instruct2Act by OpenGVLab

Robotics framework maps instructions to actions using LLMs

Created 2 years ago

Updated 1 year ago

scalingup by real-stanford

Framework for language-guided robot skill learning

Created 2 years ago

Updated 1 year ago

Starred by

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab).

LAPA by LatentActionPretraining

VLA pretraining via unsupervised latent action learning from video

Created 1 year ago

Updated 11 months ago

RDT2 by thu-ml

Foundation model for zero-shot robotic manipulation across embodiments

Created 3 months ago

Updated 1 month ago

CogACT by microsoft

Vision-language-action model for robotic manipulation

Created 1 year ago

Updated 2 months ago

RynnVLA-002 by alibaba-damo-academy

Autoregressive action world model for robotics

Created 6 months ago

Updated 1 month ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab).

Magma by microsoft

Multimodal AI agent foundation model research paper

Created 1 year ago

Updated 3 months ago

Starred by

Forrest Iandola

Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

2 more.

Isaac-GR00T by NVIDIA

Open foundation model for humanoid robot reasoning and skills

Created 10 months ago

Updated 3 weeks ago

Feedback? Help us improve.