vla-evaluation-harness by allenai

A unified framework for evaluating Vision-Language-Action (VLA) models across robot simulation benchmarks

Created 1 month ago

253 stars

Top 99.3% on SourcePulse

Project Summary

Summary

This framework standardizes the evaluation of Vision-Language-Action (VLA) models across diverse robot simulation benchmarks. It offers researchers and engineers a unified, reproducible, and highly efficient system, eliminating the common pain points of disparate dependencies and evaluation protocols, thereby accelerating VLA model development and comparison.

How It Works

The core design employs an abstraction layer to decouple VLA models from specific benchmarks. Benchmarks are containerized within Docker images, ensuring exact reproducibility and eliminating dependency conflicts. Model servers are deployed as self-contained uv scripts with inline dependency declarations, enabling zero-shot setup. This architecture facilitates seamless integration and enables a comprehensive cross-evaluation matrix, allowing models to be tested against multiple benchmarks with minimal effort.

Quick Start & Requirements

Installation is straightforward via pip install vla-eval or from source using uv sync --python 3.11 --all-extras --dev. Key requirements include Python 3.11+, Docker, and a GPU for efficient model serving. A quick start involves running a model server in one terminal and the evaluation client in another. Detailed documentation is available for architecture, contribution, and reproduction reports.

Highlighted Details

Batch Parallel Evaluation: Achieves a 47x speedup, processing 2,000 LIBERO episodes in approximately 18 minutes on a single H100 GPU, through episode sharding and batched GPU inference.
Zero Setup: Eliminates dependency hell by packaging benchmarks in Docker and model servers as single-file uv scripts.
AI-Assisted Integration: Leverages built-in Claude Code skills to scaffold new benchmark and model integrations rapidly.
Comprehensive Leaderboard: Hosts the largest unified VLA comparison, aggregating over 500 models across 17 benchmarks from more than 1,700 papers.

Maintenance & Community

The project cites a 2026 arXiv preprint, indicating recent development activity. While specific community channels (like Discord/Slack) or prominent maintainer details are not explicitly listed in the README, the contribution guidelines suggest an open process for adding support for new benchmarks and models.

Licensing & Compatibility

The project is released under the permissive Apache 2.0 license, generally compatible with commercial use and closed-source integration without significant copyleft concerns.

Limitations & Caveats

The README does not detail specific limitations, alpha status, or known bugs. However, it actively solicits contributions for expanding benchmark and model support, suggesting that the integration matrix is still evolving. The reliance on specific tools like Claude Code for AI-assisted integration might introduce external dependencies.

vla-evaluation-harness by allenai

Explore Similar Projects

MM-Vet by yuweihao

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

vla-scratch by EGalahad

FluxVLA by FluxVLA

VLABench by OpenMOSS

awesome-vla-for-ad by worldbench

Evaluator by NVIDIA-NeMo

TensorRT-Edge-LLM by NVIDIA

dexbotic by dexmal

Qwen3.6 by QwenLM

ramalama by containers

openpi by Physical-Intelligence