ERQA  by embodiedreasoning

Embodied reasoning benchmark for multimodal QA

Created 10 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary: The Embodied Reasoning Question Answer (ERQA) benchmark addresses the critical need for evaluating multimodal embodied reasoning capabilities, particularly relevant for advancing robotics and AI agents operating in real-world environments. It targets researchers and engineers by providing a curated dataset of complex, multiple-choice questions that integrate interleaved images and text. These questions probe spatial reasoning, common-sense world knowledge, and the ability to interpret visual-linguistic cues within realistic scenarios, offering a standardized method to assess AI's understanding of embodied contexts.

How It Works: ERQA is structured around a dataset of multimodal questions, stored efficiently in TFRecord format. Each example includes encoded images, textual questions, ground-truth answers (single letter choice), optional question types, and visual indices to manage image placement. The project provides a lightweight evaluation harness that facilitates querying external multimodal APIs, specifically Google's Gemini 2.0 and OpenAI's models. This design allows users to benchmark various large language and vision-language models against the ERQA dataset without needing to implement complex inference pipelines from scratch.

Quick Start & Requirements:

  • Installation: After activating a Python virtual environment, install necessary packages using pip install -r requirements.txt.
  • Prerequisites: A functional Python environment (version not specified) and API access keys for either Google Gemini or OpenAI.
  • API Key Configuration: Keys can be provided via environment variables (GEMINI_API_KEY, OPENAI_API_KEY), directly as command-line arguments (--gemini_api_key, --openai_api_key), or listed in a text file (--api_keys_file). The keys file format expects one key per line, optionally prefixed with the API type (e.g., gemini:YOUR_KEY or openai:YOUR_KEY).
  • Running Evaluation: Initiate evaluation with python eval_harness.py. Default execution uses the Gemini API. Users can specify models (e.g., --model gemini-2.0-flash-exp, --api openai --model gpt-4o-2024-11-20), the number of examples to process (--num_examples), and other parameters.
  • Documentation: Further details, visualizations, and the tech report are referenced but not directly linked in the provided snippet.

Highlighted Details:

  • The ERQA benchmark comprises 400 multimodal examples designed to test embodied reasoning.
  • The evaluation harness supports querying multiple state-of-the-art models, including various Gemini 2.0 configurations (e.g., gemini-2.0-flash-exp, gemini-2.0-pro) and OpenAI models (e.g., gpt-4o-2024-11-20).
  • Features robust retry mechanisms for API calls, automatically handling resource exhaustion (HTTP 429) and connection errors with configurable retries and backoff periods. It supports using multiple API keys sequentially to improve success rates.

Maintenance & Community: The provided README snippet does not contain information regarding project maintainers, community support channels (e.g., Discord, Slack), contribution guidelines, or a public roadmap.

Licensing & Compatibility: Specific licensing information (e.g., MIT, Apache, GPL) and any associated restrictions for commercial use or integration with closed-source systems are not detailed in the provided text.

Limitations & Caveats: Successful setup and execution necessitate obtaining and configuring API keys for external cloud-based AI services (Gemini, OpenAI), which may involve associated costs and usage limits. The evaluation harness is tailored for these specific APIs; extending it to other models or custom local deployments would require significant modifications. Some Gemini models referenced are designated as experimental. The benchmark focuses specifically on embodied reasoning tasks and may not cover all aspects of AI agent capabilities.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.