prm800k  by openai

Dataset of LLM solutions to math problems with step-level correctness labels

Created 2 years ago
2,047 stars

Top 21.8% on SourcePulse

GitHubView on GitHub
Project Summary

PRM800K provides 800,000 step-level correctness labels for LLM-generated solutions to MATH dataset problems. This dataset is designed for researchers and engineers developing and evaluating process-based supervision methods for improving LLM reasoning capabilities, particularly in complex mathematical problem-solving.

How It Works

The dataset consists of detailed JSON annotations for each step within LLM-generated solutions. Each step is rated by human labelers on a scale of -1 (incorrect), 0 (neutral/no progress), or +1 (correct). The data includes metadata like labeler UUIDs, timestamps, generation phase, and quality control flags. It also contains the original MATH problem, ground truth solution, and the model-generated solution with its step-by-step breakdown.

Quick Start & Requirements

  • Install: Clone the repository using Git LFS (git lfs install && git clone <repo_url>).
  • Dependencies: Python, Git LFS.
  • Data: The data/ folder contains newline-delimited JSON data.
  • Evaluation: Run python eval/eval.py --method prm for PRM evaluation or python eval/eval.py --method orm for ORM evaluation.
  • Resources: Requires sufficient disk space for Git LFS data.

Highlighted Details

  • Contains 800,000 step-level correctness labels.
  • Includes original MATH problems, ground truth, and model-generated solutions.
  • Provides human labeler instructions for data collection phases.
  • Offers Python scripts for answer grading using SymPy for expression equality.
  • Uses a custom MATH train/test split for evaluation.

Maintenance & Community

The project is associated with OpenAI and the paper "Let's Verify Step by Step." Further details and community engagement can be found via the linked blog post and paper.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes that the answer grading logic is conservative and may occasionally reject correct answers or admit incorrect ones. The dataset uses a non-standard MATH train/test split, which may affect direct comparability with other benchmarks using standard splits.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
7 more.

reasoning-gym by open-thought

1.2%
1k
Procedural dataset generator for reasoning models
Created 7 months ago
Updated 3 days ago
Feedback? Help us improve.