MathVista  by lupantech

Benchmark for evaluating math reasoning in visual contexts

created 1 year ago
325 stars

Top 85.0% on sourcepulse

GitHubView on GitHub
Project Summary

MathVista is a benchmark dataset and evaluation framework designed to systematically assess the mathematical reasoning capabilities of foundation models within visual contexts. It targets researchers and developers working on multimodal AI, providing a standardized way to measure progress in complex visual-mathematical problem-solving.

How It Works

MathVista comprises 6,141 examples synthesized from 28 existing datasets and 3 new ones (IQTest, FunctionQA, PaperQA). It requires models to perform fine-grained visual understanding and compositional reasoning across various mathematical tasks, including arithmetic, algebra, geometry, and logic, presented with visual elements. The evaluation process involves three stages: response generation, answer extraction (using an LLM like GPT-4), and score calculation.

Quick Start & Requirements

  • Dataset Loading: from datasets import load_dataset; dataset = load_dataset("AI4Math/MathVista")
  • Dependencies: Python, Hugging Face Datasets library. Optional dependencies for evaluating specific models include openai, anthropic, and bardapi. API keys for these services are required for reproduction.
  • Image Download: Optional wget and unzip commands are provided to download images.
  • Resources: The testmini subset (1,000 examples) is available for development. Full dataset details and evaluation scripts are available on the project page.

Highlighted Details

  • Benchmarks 60+ models, including leading LLMs and LMMs, with a continuously updated leaderboard.
  • Covers 7 mathematical reasoning skills (algebraic, arithmetic, geometry, logical, numeric, scientific, statistical) and various task types (FQA, GPS, MWP, TQA, VQA).
  • Features a three-stage evaluation pipeline designed for models with longer response generation.
  • Includes interactive dataset exploration tools and visualization pages.

Maintenance & Community

The project is associated with ICLR 2024 (Oral presentation). Updates and discussions can be found via Twitter and GitHub issues. Key contributors are listed from UCLA, University of Washington, and Microsoft Research.

Licensing & Compatibility

The dataset's new contributions are under CC BY-SA 4.0. Commercial use is permitted as a test set, but prohibited for training. Copyright of images and original questions belongs to their respective authors.

Limitations & Caveats

The answer labels for the full test subset are not publicly released, requiring submission for evaluation. The evaluation process relies on an LLM (e.g., GPT-4) for answer extraction, introducing a potential dependency on that model's performance.

Health Check
Last commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.