Math-Verify  by huggingface

Math evaluator for LLM outputs in mathematical tasks

created 6 months ago
861 stars

Top 42.5% on sourcepulse

GitHubView on GitHub
Project Summary

Math-Verify is a robust mathematical expression evaluation system designed to accurately assess Large Language Model outputs on mathematical tasks. It targets researchers and developers working with LLMs on math-heavy benchmarks, offering superior accuracy over existing evaluators by handling diverse mathematical notations and flexible comparison logic.

How It Works

Math-Verify employs a three-step process: Answer Extraction, Expression Common Representation Conversion (via SymPy), and Gold Comparison. It uses format-agnostic extraction with prioritized regex patterns to retrieve answers, supporting LaTeX, plain expressions, and strings. Extracted answers are normalized to a common SymPy representation, fixing malformations and handling units, percentages, and set theory. Comparisons include numerical (with tolerance), symbolic, relational, set, interval, and matrix equivalences.

Quick Start & Requirements

  • Install with a specific ANTLR4 runtime: pip install math-verify[antlr4_13_2]
  • Recommended: pip install 'math-verify[inference]' for end-to-end evaluation.
  • Prerequisites: Python, ANTLR4 runtime (specified during install).
  • Documentation: https://github.com/huggingface/Math-Verify

Highlighted Details

  • Achieves highest accuracy on the MATH dataset compared to Harness and Qwen evaluators.
  • Supports comprehensive extraction targets: LaTeX, plain expressions, and literal strings.
  • Advanced parsing includes set theory, Unicode substitution, unit handling, and matrix operations.
  • Intelligent comparison handles numerical precision, symbolic equivalence, and relational expressions.

Maintenance & Community

  • Developed by Hugging Face.
  • No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • License: Not explicitly stated in the README. Potential ambiguity if multiple licenses are implied by dependencies.
  • Compatibility: Designed for LLM output evaluation; commercial use depends on the unstated license.

Limitations & Caveats

The verify function exhibits asymmetry in comparisons between inequalities and intervals, and between numbers and solution chains, which is intentional to prevent reward hacking but may require specific configuration (allow_set_relation_comp). LaTeX extraction requires expressions to be within specific environments (e.g., \[ ... \], $ ... $).

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
4
Star History
203 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.