Discover and explore top open-source AI tools and projects—updated daily.
embodiedreasoningEmbodied reasoning benchmark for multimodal QA
Top 98.8% on SourcePulse
Summary: The Embodied Reasoning Question Answer (ERQA) benchmark addresses the critical need for evaluating multimodal embodied reasoning capabilities, particularly relevant for advancing robotics and AI agents operating in real-world environments. It targets researchers and engineers by providing a curated dataset of complex, multiple-choice questions that integrate interleaved images and text. These questions probe spatial reasoning, common-sense world knowledge, and the ability to interpret visual-linguistic cues within realistic scenarios, offering a standardized method to assess AI's understanding of embodied contexts.
How It Works: ERQA is structured around a dataset of multimodal questions, stored efficiently in TFRecord format. Each example includes encoded images, textual questions, ground-truth answers (single letter choice), optional question types, and visual indices to manage image placement. The project provides a lightweight evaluation harness that facilitates querying external multimodal APIs, specifically Google's Gemini 2.0 and OpenAI's models. This design allows users to benchmark various large language and vision-language models against the ERQA dataset without needing to implement complex inference pipelines from scratch.
Quick Start & Requirements:
pip install -r requirements.txt.GEMINI_API_KEY, OPENAI_API_KEY), directly as command-line arguments (--gemini_api_key, --openai_api_key), or listed in a text file (--api_keys_file). The keys file format expects one key per line, optionally prefixed with the API type (e.g., gemini:YOUR_KEY or openai:YOUR_KEY).python eval_harness.py. Default execution uses the Gemini API. Users can specify models (e.g., --model gemini-2.0-flash-exp, --api openai --model gpt-4o-2024-11-20), the number of examples to process (--num_examples), and other parameters.Highlighted Details:
gemini-2.0-flash-exp, gemini-2.0-pro) and OpenAI models (e.g., gpt-4o-2024-11-20).Maintenance & Community: The provided README snippet does not contain information regarding project maintainers, community support channels (e.g., Discord, Slack), contribution guidelines, or a public roadmap.
Licensing & Compatibility: Specific licensing information (e.g., MIT, Apache, GPL) and any associated restrictions for commercial use or integration with closed-source systems are not detailed in the provided text.
Limitations & Caveats: Successful setup and execution necessitate obtaining and configuring API keys for external cloud-based AI services (Gemini, OpenAI), which may involve associated costs and usage limits. The evaluation harness is tailored for these specific APIs; extending it to other models or custom local deployments would require significant modifications. Some Gemini models referenced are designated as experimental. The benchmark focuses specifically on embodied reasoning tasks and may not cover all aspects of AI agent capabilities.
10 months ago
Inactive