VLABench  by OpenMOSS

Benchmark for robotics manipulation and embodied agents

created 7 months ago
274 stars

Top 94.3% on SourcePulse

GitHubView on GitHub
Project Summary

VLABench is a large-scale benchmark suite designed for evaluating Vision-Language Agents (VLAs) and Vision-Language Models (VLMs) in robotics manipulation tasks. It targets researchers and engineers working on embodied AI and language-conditioned robotics, providing a standardized framework for assessing long-horizon reasoning and generalization capabilities.

How It Works

VLABench utilizes a modular framework for task construction, allowing for high adaptability and expansion. It offers standardized benchmark datasets across various dimensions, including in-distribution performance, cross-category generalization, common-sense reasoning, semantic instruction following, cross-task transfer, and visual robustness to texture variations. The evaluation framework is designed to ensure fair comparisons across different models and machines.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n vlabench python=3.10), activate it, install requirements (pip install -r requirements.txt), and install VLABench locally (pip install -e .).
  • Assets: Download necessary assets using python scripts/download_assets.py.
  • Submodules: Initialize submodules with git submodule update --init --recursive.
  • Prerequisites: Python 3.10, Conda.
  • Resources: Data collection can be parallelized. Evaluation of each task can take 30 minutes to 1 hour per process.
  • Links: Paper, Project Website, Hugging Face.

Highlighted Details

  • Supports multiple evaluation tracks focusing on different generalization capabilities.
  • Provides standardized fine-tuning datasets for primitive tasks.
  • Offers scripts for converting data to RLDS and Libero formats.
  • Includes evaluation pipelines for popular VLA models like OpenVLA and Open-Pi, and VLMs such as GPT-4v, Qwen2-VL, and Llava.
  • Supports multi-GPU accelerated evaluation for faster benchmarking.

Maintenance & Community

The project is actively maintained, with recent updates including parallel data collection, camera augmentation, and the release of finetuned checkpoints. The authors encourage community contributions via issues and pull requests and plan to release a comprehensive infra framework, including training pipelines and a leaderboard.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The preview version's functionalities are still being managed and tested. The current data collection scripts do not support multi-processing within the code, though parallelization is planned. The conversion to RLDS format is noted as time-consuming with a single process, and the original repo codes for this conversion may have bugs.

Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
16
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
16 more.

open-r1 by huggingface

0.3%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 5 days ago
Feedback? Help us improve.