Discover and explore top open-source AI tools and projects—updated daily.
Benchmarking toolkit for LLM mathematical reasoning
Top 98.7% on SourcePulse
This toolkit provides a unified, precise, and extensible framework for benchmarking Large Language Models (LLMs) on a wide range of mathematical reasoning tasks. It aims to harmonize evaluation methods across different research projects, ensuring consistent and reliable comparisons, and is beneficial for researchers and developers working with LLMs in mathematical domains.
How It Works
The harness supports various prompting paradigms, including Direct, Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (e.g., ToRA). It offers seamless compatibility with models from Hugging Face and vLLM, and integrates a diverse array of mathematical datasets such as GSM8K, MATH, and SVAMP. This flexibility allows for comprehensive evaluation of LLM capabilities across different reasoning strategies and data types.
Quick Start & Requirements
conda create -n math_eval python=3.10
, conda activate math_eval
) or use a provided Docker image (vLLM).git clone https://github.com/ZubinGou/math-evaluation-harness.git
), navigate into the directory, and install requirements (pip install -r requirements.txt
).scripts/run_math_eval.sh
and set the PROMPT_TYPE
variable. Execute the evaluation using bash scripts/run_eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH
.Highlighted Details
Maintenance & Community
The project is under active development, welcoming contributions via bug reports, feature requests, and pull requests. It references other key projects like ToRA, prm800k, lm-evaluation-harness, and DeepSeek-Math.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the license.
Limitations & Caveats
The README mentions that the original MATH test set has been included in public training sets, suggesting the use of the OpenAI test subset for evaluating MATH performance. The project is actively under development, which may imply ongoing changes or potential instability.
1 year ago
Inactive