prm800k by openai

Dataset of LLM solutions to math problems with step-level correctness labels

Created 2 years ago

2,082 stars

Top 21.2% on SourcePulse

View on GitHub

9 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Jeremy Howard

Cofounder of fast.ai

Pawel Garbacki

Cofounder of Fireworks AI

and 5 more!

Project Summary

PRM800K provides 800,000 step-level correctness labels for LLM-generated solutions to MATH dataset problems. This dataset is designed for researchers and engineers developing and evaluating process-based supervision methods for improving LLM reasoning capabilities, particularly in complex mathematical problem-solving.

How It Works

The dataset consists of detailed JSON annotations for each step within LLM-generated solutions. Each step is rated by human labelers on a scale of -1 (incorrect), 0 (neutral/no progress), or +1 (correct). The data includes metadata like labeler UUIDs, timestamps, generation phase, and quality control flags. It also contains the original MATH problem, ground truth solution, and the model-generated solution with its step-by-step breakdown.

Quick Start & Requirements

Install: Clone the repository using Git LFS (git lfs install && git clone <repo_url>).
Dependencies: Python, Git LFS.
Data: The data/ folder contains newline-delimited JSON data.
Evaluation: Run python eval/eval.py --method prm for PRM evaluation or python eval/eval.py --method orm for ORM evaluation.
Resources: Requires sufficient disk space for Git LFS data.

Highlighted Details

Contains 800,000 step-level correctness labels.
Includes original MATH problems, ground truth, and model-generated solutions.
Provides human labeler instructions for data collection phases.
Offers Python scripts for answer grading using SymPy for expression equality.
Uses a custom MATH train/test split for evaluation.

Maintenance & Community

The project is associated with OpenAI and the paper "Let's Verify Step by Step." Further details and community engagement can be found via the linked blog post and paper.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes that the answer grading logic is conservative and may occasionally reject correct answers or admit incorrect ones. The dataset uses a non-standard MATH train/test split, which may affect direct comparability with other benchmarks using standard splits.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days