Dataset of LLM solutions to math problems with step-level correctness labels
Top 22.3% on sourcepulse
PRM800K provides 800,000 step-level correctness labels for LLM-generated solutions to MATH dataset problems. This dataset is designed for researchers and engineers developing and evaluating process-based supervision methods for improving LLM reasoning capabilities, particularly in complex mathematical problem-solving.
How It Works
The dataset consists of detailed JSON annotations for each step within LLM-generated solutions. Each step is rated by human labelers on a scale of -1 (incorrect), 0 (neutral/no progress), or +1 (correct). The data includes metadata like labeler UUIDs, timestamps, generation phase, and quality control flags. It also contains the original MATH problem, ground truth solution, and the model-generated solution with its step-by-step breakdown.
Quick Start & Requirements
git lfs install && git clone <repo_url>
).data/
folder contains newline-delimited JSON data.python eval/eval.py --method prm
for PRM evaluation or python eval/eval.py --method orm
for ORM evaluation.Highlighted Details
Maintenance & Community
The project is associated with OpenAI and the paper "Let's Verify Step by Step." Further details and community engagement can be found via the linked blog post and paper.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The README notes that the answer grading logic is conservative and may occasionally reject correct answers or admit incorrect ones. The dataset uses a non-standard MATH train/test split, which may affect direct comparability with other benchmarks using standard splits.
2 years ago
1+ week