MathVista by lupantech

Benchmark for evaluating math reasoning in visual contexts

Created 2 years ago

355 stars

Top 78.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

MathVista is a benchmark dataset and evaluation framework designed to systematically assess the mathematical reasoning capabilities of foundation models within visual contexts. It targets researchers and developers working on multimodal AI, providing a standardized way to measure progress in complex visual-mathematical problem-solving.

How It Works

MathVista comprises 6,141 examples synthesized from 28 existing datasets and 3 new ones (IQTest, FunctionQA, PaperQA). It requires models to perform fine-grained visual understanding and compositional reasoning across various mathematical tasks, including arithmetic, algebra, geometry, and logic, presented with visual elements. The evaluation process involves three stages: response generation, answer extraction (using an LLM like GPT-4), and score calculation.

Quick Start & Requirements

Dataset Loading: from datasets import load_dataset; dataset = load_dataset("AI4Math/MathVista")
Dependencies: Python, Hugging Face Datasets library. Optional dependencies for evaluating specific models include openai, anthropic, and bardapi. API keys for these services are required for reproduction.
Image Download: Optional wget and unzip commands are provided to download images.
Resources: The testmini subset (1,000 examples) is available for development. Full dataset details and evaluation scripts are available on the project page.

Highlighted Details

Benchmarks 60+ models, including leading LLMs and LMMs, with a continuously updated leaderboard.
Covers 7 mathematical reasoning skills (algebraic, arithmetic, geometry, logical, numeric, scientific, statistical) and various task types (FQA, GPS, MWP, TQA, VQA).
Features a three-stage evaluation pipeline designed for models with longer response generation.
Includes interactive dataset exploration tools and visualization pages.

Maintenance & Community

The project is associated with ICLR 2024 (Oral presentation). Updates and discussions can be found via Twitter and GitHub issues. Key contributors are listed from UCLA, University of Washington, and Microsoft Research.

Licensing & Compatibility

The dataset's new contributions are under CC BY-SA 4.0. Commercial use is permitted as a test set, but prohibited for training. Copyright of images and original questions belongs to their respective authors.

Limitations & Caveats

The answer labels for the full test subset are not publicly released, requiring submission for evaluation. The evaluation process relies on an LLM (e.g., GPT-4) for answer extraction, introducing a potential dependency on that model's performance.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days