SimpleVLA-RL by PRIME-RL

Online RL for VLA models with minimal data

Created 7 months ago

1,228 stars

Top 31.9% on SourcePulse

Project Summary

This repository introduces SimpleVLA-RL, an approach for online Reinforcement Learning (RL) in Vision-Language-Action (VLA) models. It enables effective training with minimal data, using only simple 0/1 outcome-level rewards, making VLA model training more data-efficient and achieving performance comparable to full-trajectory Supervised Fine-Tuning (SFT). The target audience includes researchers and practitioners working with VLA models for robotics and embodied AI.

How It Works

SimpleVLA-RL leverages outcome-level 0/1 reward signals directly from simulation environments. This approach simplifies reward engineering and significantly reduces the need for extensive, high-quality trajectory data. By using only one trajectory per task for initial SFT, it demonstrates that simple rewards can drive effective online RL, leading to substantial performance gains over baseline SFT models.

Quick Start & Requirements

Installation: Requires setting up the veRL environment and the OpenVLA-OFT model, following their respective guides.
Prerequisites: Weights and Biases (WandB) API key, Python environment, and specific NVIDIA drivers (tested with 470.161.03) and CUDA (tested with 12.4).
Hardware: Tested on single-node (8x A800 80GB GPUs) and multi-node (16x A800 80GB GPUs) setups.
Resources: Requires downloading SFT models (e.g., libero-10 traj1 SFT).
Training Command: bash examples/run_openvla_oft_rl.sh
Documentation: veRL, OpenVLA-OFT

Highlighted Details

Achieves 97.6 points on LIBERO-Long using OpenVLA-OFT.
Improves OpenVLA-OFT performance from 17.3 to 91.7 (430.1% increase) with only one trajectory per task for SFT.
Supports LIBERO benchmark; RoboTwin benchmark is planned.
Models available on Hugging Face: SimpleVLA-RL Collection.

Maintenance & Community

Code released May 27, 2025.
Contact: Haozhan Li (zhan72426@gmail.com), Ning Ding (dingning@mail.tsinghua.edu.cn).
Project is based on veRL, OpenVLA-OFT, and PRIME.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying projects (veRL, OpenVLA-OFT) should be consulted for licensing details.

Limitations & Caveats

The README notes that their openvla-oft model design differs from the official one.
Support for additional benchmarks (e.g., RoboTwin) and tokenizers (Pi0 fast tokenizer) is listed as a TODO.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

6

Star History

131 stars in the last 30 days

Explore Similar Projects

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

Advancing robotic manipulation with large Vision-Language-Action models

Created 7 months ago

Updated 3 weeks ago

RoboVLMs by Robot-VLAs

VLA codebase for integrating vision-language models into robot policies

Created 1 year ago

Updated 1 month ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

vla0 by NVlabs

State-of-the-art Vision-Language-Action models via text-based action representation

Created 2 months ago

Updated 2 days ago

Impromptu-VLA by ahydchh

Vision-language-action models for driving

Created 8 months ago

Updated 2 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI).

RL4VLM by RL4VLM

Research paper for fine-tuning VLMs as decision-making agents via RL

Created 1 year ago

Updated 1 year ago

simlingo by RenzKa

Vision-only autonomous driving with language-action alignment

Created 10 months ago

Updated 4 months ago

OpenDriveVLA by DriveVLA

End-to-end autonomous driving with a VLA model

Created 9 months ago

Updated 1 month ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Online-RLHF by RLHFlow

Recipe for online iterative RLHF to align LLMs

Created 1 year ago

Updated 1 year ago

RynnVLA-002 by alibaba-damo-academy

Autoregressive action world model for robotics

Created 6 months ago

Updated 1 month ago

VLA-Adapter by OpenHelix-Team

Tiny-scale Vision-Language-Action model paradigm

Created 3 months ago

Updated 1 month ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face).

R1-V by StarsfieldAI

VLM research for reinforcing generalization with minimal cost

Created 11 months ago

Updated 7 months ago

Starred by

Brian Ichter

Brian Ichter(Cofounder of Physical Intelligence),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

6 more.

openpi by Physical-Intelligence

Robotics vision-language-action models

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.