vla-evaluation-harness  by allenai

A unified framework for evaluating Vision-Language-Action (VLA) models across robot simulation benchmarks

Created 1 month ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This framework standardizes the evaluation of Vision-Language-Action (VLA) models across diverse robot simulation benchmarks. It offers researchers and engineers a unified, reproducible, and highly efficient system, eliminating the common pain points of disparate dependencies and evaluation protocols, thereby accelerating VLA model development and comparison.

How It Works

The core design employs an abstraction layer to decouple VLA models from specific benchmarks. Benchmarks are containerized within Docker images, ensuring exact reproducibility and eliminating dependency conflicts. Model servers are deployed as self-contained uv scripts with inline dependency declarations, enabling zero-shot setup. This architecture facilitates seamless integration and enables a comprehensive cross-evaluation matrix, allowing models to be tested against multiple benchmarks with minimal effort.

Quick Start & Requirements

Installation is straightforward via pip install vla-eval or from source using uv sync --python 3.11 --all-extras --dev. Key requirements include Python 3.11+, Docker, and a GPU for efficient model serving. A quick start involves running a model server in one terminal and the evaluation client in another. Detailed documentation is available for architecture, contribution, and reproduction reports.

Highlighted Details

  • Batch Parallel Evaluation: Achieves a 47x speedup, processing 2,000 LIBERO episodes in approximately 18 minutes on a single H100 GPU, through episode sharding and batched GPU inference.
  • Zero Setup: Eliminates dependency hell by packaging benchmarks in Docker and model servers as single-file uv scripts.
  • AI-Assisted Integration: Leverages built-in Claude Code skills to scaffold new benchmark and model integrations rapidly.
  • Comprehensive Leaderboard: Hosts the largest unified VLA comparison, aggregating over 500 models across 17 benchmarks from more than 1,700 papers.

Maintenance & Community

The project cites a 2026 arXiv preprint, indicating recent development activity. While specific community channels (like Discord/Slack) or prominent maintainer details are not explicitly listed in the README, the contribution guidelines suggest an open process for adding support for new benchmarks and models.

Licensing & Compatibility

The project is released under the permissive Apache 2.0 license, generally compatible with commercial use and closed-source integration without significant copyleft concerns.

Limitations & Caveats

The README does not detail specific limitations, alpha status, or known bugs. However, it actively solicits contributions for expanding benchmark and model support, suggesting that the integration matrix is still evolving. The reliance on specific tools like Claude Code for AI-assisted integration might introduce external dependencies.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
25
Issues (30d)
6
Star History
100 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.