math-evaluation-harness  by ZubinGou

Benchmarking toolkit for LLM mathematical reasoning

Created 1 year ago
256 stars

Top 98.7% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit provides a unified, precise, and extensible framework for benchmarking Large Language Models (LLMs) on a wide range of mathematical reasoning tasks. It aims to harmonize evaluation methods across different research projects, ensuring consistent and reliable comparisons, and is beneficial for researchers and developers working with LLMs in mathematical domains.

How It Works

The harness supports various prompting paradigms, including Direct, Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (e.g., ToRA). It offers seamless compatibility with models from Hugging Face and vLLM, and integrates a diverse array of mathematical datasets such as GSM8K, MATH, and SVAMP. This flexibility allows for comprehensive evaluation of LLM capabilities across different reasoning strategies and data types.

Quick Start & Requirements

  • Environment Setup: Users can set up a Conda environment (conda create -n math_eval python=3.10, conda activate math_eval) or use a provided Docker image (vLLM).
  • Installation: Clone the repository (git clone https://github.com/ZubinGou/math-evaluation-harness.git), navigate into the directory, and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.10 is recommended. GPU support is implied for running LLMs.
  • Usage: Configure model and data settings in scripts/run_math_eval.sh and set the PROMPT_TYPE variable. Execute the evaluation using bash scripts/run_eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH.

Highlighted Details

  • Supports a broad spectrum of mathematical datasets, including minerva_math, math, gsm8k, svamp, asdiv, mawps, tabmwp, finqa, theorem_qa, bbh, mmlu_stem, sat_math, mathqa, and hungarian_exam.
  • Compatible with various prompting strategies like Direct, CoT, PoT/PAL, and Tool-Integrated Reasoning.
  • Benchmarks provided for numerous base and fine-tuned models, including LLaMA-2, Mistral, Minerva, and DeepSeekMath, across different datasets and prompting methods.
  • Noted variances above 5% in results from diverse math evaluation frameworks, highlighting the need for harmonization.

Maintenance & Community

The project is under active development, welcoming contributions via bug reports, feature requests, and pull requests. It references other key projects like ToRA, prm800k, lm-evaluation-harness, and DeepSeek-Math.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The README mentions that the original MATH test set has been included in public training sets, suggesting the use of the OpenAI test subset for evaluating MATH performance. The project is actively under development, which may imply ongoing changes or potential instability.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
3 more.

Math-Verify by huggingface

0.8%
933
Math evaluator for LLM outputs in mathematical tasks
Created 8 months ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.