math-evaluation-harness by ZubinGou

Benchmarking toolkit for LLM mathematical reasoning

Created 1 year ago

271 stars

Top 95.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Junyang Lin

Core Maintainer at Alibaba Qwen

Lewis Tunstall

Research Engineer at Hugging Face

Project Summary

This toolkit provides a unified, precise, and extensible framework for benchmarking Large Language Models (LLMs) on a wide range of mathematical reasoning tasks. It aims to harmonize evaluation methods across different research projects, ensuring consistent and reliable comparisons, and is beneficial for researchers and developers working with LLMs in mathematical domains.

How It Works

The harness supports various prompting paradigms, including Direct, Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (e.g., ToRA). It offers seamless compatibility with models from Hugging Face and vLLM, and integrates a diverse array of mathematical datasets such as GSM8K, MATH, and SVAMP. This flexibility allows for comprehensive evaluation of LLM capabilities across different reasoning strategies and data types.

Quick Start & Requirements

Environment Setup: Users can set up a Conda environment (conda create -n math_eval python=3.10, conda activate math_eval) or use a provided Docker image (vLLM).
Installation: Clone the repository (git clone https://github.com/ZubinGou/math-evaluation-harness.git), navigate into the directory, and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.10 is recommended. GPU support is implied for running LLMs.
Usage: Configure model and data settings in scripts/run_math_eval.sh and set the PROMPT_TYPE variable. Execute the evaluation using bash scripts/run_eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH.

Highlighted Details

Supports a broad spectrum of mathematical datasets, including minerva_math, math, gsm8k, svamp, asdiv, mawps, tabmwp, finqa, theorem_qa, bbh, mmlu_stem, sat_math, mathqa, and hungarian_exam.
Compatible with various prompting strategies like Direct, CoT, PoT/PAL, and Tool-Integrated Reasoning.
Benchmarks provided for numerous base and fine-tuned models, including LLaMA-2, Mistral, Minerva, and DeepSeekMath, across different datasets and prompting methods.
Noted variances above 5% in results from diverse math evaluation frameworks, highlighting the need for harmonization.

Maintenance & Community

The project is under active development, welcoming contributions via bug reports, feature requests, and pull requests. It references other key projects like ToRA, prm800k, lm-evaluation-harness, and DeepSeek-Math.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The README mentions that the original MATH test set has been included in public training sets, suggesting the use of the OpenAI test subset for evaluating MATH performance. The project is actively under development, which may imply ongoing changes or potential instability.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days