Math-Verify  by huggingface

Math evaluator for LLM outputs in mathematical tasks

Created 11 months ago
1,067 stars

Top 35.5% on SourcePulse

GitHubView on GitHub
Project Summary

Math-Verify is a robust mathematical expression evaluation system designed to accurately assess Large Language Model outputs on mathematical tasks. It targets researchers and developers working with LLMs on math-heavy benchmarks, offering superior accuracy over existing evaluators by handling diverse mathematical notations and flexible comparison logic.

How It Works

Math-Verify employs a three-step process: Answer Extraction, Expression Common Representation Conversion (via SymPy), and Gold Comparison. It uses format-agnostic extraction with prioritized regex patterns to retrieve answers, supporting LaTeX, plain expressions, and strings. Extracted answers are normalized to a common SymPy representation, fixing malformations and handling units, percentages, and set theory. Comparisons include numerical (with tolerance), symbolic, relational, set, interval, and matrix equivalences.

Quick Start & Requirements

  • Install with a specific ANTLR4 runtime: pip install math-verify[antlr4_13_2]
  • Recommended: pip install 'math-verify[inference]' for end-to-end evaluation.
  • Prerequisites: Python, ANTLR4 runtime (specified during install).
  • Documentation: https://github.com/huggingface/Math-Verify

Highlighted Details

  • Achieves highest accuracy on the MATH dataset compared to Harness and Qwen evaluators.
  • Supports comprehensive extraction targets: LaTeX, plain expressions, and literal strings.
  • Advanced parsing includes set theory, Unicode substitution, unit handling, and matrix operations.
  • Intelligent comparison handles numerical precision, symbolic equivalence, and relational expressions.

Maintenance & Community

  • Developed by Hugging Face.
  • No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • License: Not explicitly stated in the README. Potential ambiguity if multiple licenses are implied by dependencies.
  • Compatibility: Designed for LLM output evaluation; commercial use depends on the unstated license.

Limitations & Caveats

The verify function exhibits asymmetry in comparisons between inequalities and intervals, and between numbers and solution chains, which is intentional to prevent reward hacking but may require specific configuration (allow_set_relation_comp). LaTeX extraction requires expressions to be within specific environments (e.g., \[ ... \], $ ... $).

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
8
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen).

InternLM-Math by InternLM

0.4%
532
Math LLM for bilingual reasoning tasks
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.