Benchmark for evaluating math reasoning in visual contexts
Top 85.0% on sourcepulse
MathVista is a benchmark dataset and evaluation framework designed to systematically assess the mathematical reasoning capabilities of foundation models within visual contexts. It targets researchers and developers working on multimodal AI, providing a standardized way to measure progress in complex visual-mathematical problem-solving.
How It Works
MathVista comprises 6,141 examples synthesized from 28 existing datasets and 3 new ones (IQTest, FunctionQA, PaperQA). It requires models to perform fine-grained visual understanding and compositional reasoning across various mathematical tasks, including arithmetic, algebra, geometry, and logic, presented with visual elements. The evaluation process involves three stages: response generation, answer extraction (using an LLM like GPT-4), and score calculation.
Quick Start & Requirements
from datasets import load_dataset; dataset = load_dataset("AI4Math/MathVista")
openai
, anthropic
, and bardapi
. API keys for these services are required for reproduction.wget
and unzip
commands are provided to download images.testmini
subset (1,000 examples) is available for development. Full dataset details and evaluation scripts are available on the project page.Highlighted Details
Maintenance & Community
The project is associated with ICLR 2024 (Oral presentation). Updates and discussions can be found via Twitter and GitHub issues. Key contributors are listed from UCLA, University of Washington, and Microsoft Research.
Licensing & Compatibility
The dataset's new contributions are under CC BY-SA 4.0. Commercial use is permitted as a test set, but prohibited for training. Copyright of images and original questions belongs to their respective authors.
Limitations & Caveats
The answer labels for the full test
subset are not publicly released, requiring submission for evaluation. The evaluation process relies on an LLM (e.g., GPT-4) for answer extraction, introducing a potential dependency on that model's performance.
8 months ago
1 week