Math-Verify by huggingface

Math evaluator for LLM outputs in mathematical tasks

Created 11 months ago

1,067 stars

Top 35.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Thomas Wolf

Cofounder of Hugging Face

Yaowei Zheng

Author of LLaMA-Factory

and 1 more!

Project Summary

Math-Verify is a robust mathematical expression evaluation system designed to accurately assess Large Language Model outputs on mathematical tasks. It targets researchers and developers working with LLMs on math-heavy benchmarks, offering superior accuracy over existing evaluators by handling diverse mathematical notations and flexible comparison logic.

How It Works

Math-Verify employs a three-step process: Answer Extraction, Expression Common Representation Conversion (via SymPy), and Gold Comparison. It uses format-agnostic extraction with prioritized regex patterns to retrieve answers, supporting LaTeX, plain expressions, and strings. Extracted answers are normalized to a common SymPy representation, fixing malformations and handling units, percentages, and set theory. Comparisons include numerical (with tolerance), symbolic, relational, set, interval, and matrix equivalences.

Quick Start & Requirements

Install with a specific ANTLR4 runtime: pip install math-verify[antlr4_13_2]
Recommended: pip install 'math-verify[inference]' for end-to-end evaluation.
Prerequisites: Python, ANTLR4 runtime (specified during install).
Documentation: https://github.com/huggingface/Math-Verify

Highlighted Details

Achieves highest accuracy on the MATH dataset compared to Harness and Qwen evaluators.
Supports comprehensive extraction targets: LaTeX, plain expressions, and literal strings.
Advanced parsing includes set theory, Unicode substitution, unit handling, and matrix operations.
Intelligent comparison handles numerical precision, symbolic equivalence, and relational expressions.

Maintenance & Community

Developed by Hugging Face.
No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

License: Not explicitly stated in the README. Potential ambiguity if multiple licenses are implied by dependencies.
Compatibility: Designed for LLM output evaluation; commercial use depends on the unstated license.

Limitations & Caveats

The verify function exhibits asymmetry in comparisons between inequalities and intervals, and between numbers and solution chains, which is intentional to prevent reward hacking but may require specific configuration (allow_set_relation_comp). LaTeX extraction requires expressions to be within specific environments (e.g., \[ ... \], $ ... $ ).

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

38 stars in the last 30 days