Math evaluator for LLM outputs in mathematical tasks
Top 42.5% on sourcepulse
Math-Verify is a robust mathematical expression evaluation system designed to accurately assess Large Language Model outputs on mathematical tasks. It targets researchers and developers working with LLMs on math-heavy benchmarks, offering superior accuracy over existing evaluators by handling diverse mathematical notations and flexible comparison logic.
How It Works
Math-Verify employs a three-step process: Answer Extraction, Expression Common Representation Conversion (via SymPy), and Gold Comparison. It uses format-agnostic extraction with prioritized regex patterns to retrieve answers, supporting LaTeX, plain expressions, and strings. Extracted answers are normalized to a common SymPy representation, fixing malformations and handling units, percentages, and set theory. Comparisons include numerical (with tolerance), symbolic, relational, set, interval, and matrix equivalences.
Quick Start & Requirements
pip install math-verify[antlr4_13_2]
pip install 'math-verify[inference]'
for end-to-end evaluation.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The verify
function exhibits asymmetry in comparisons between inequalities and intervals, and between numbers and solution chains, which is intentional to prevent reward hacking but may require specific configuration (allow_set_relation_comp
). LaTeX extraction requires expressions to be within specific environments (e.g., \[ ... \]
, $ ... $
).
1 month ago
1 week