prm800k  by openai

Dataset of LLM solutions to math problems with step-level correctness labels

created 2 years ago
2,032 stars

Top 22.3% on sourcepulse

GitHubView on GitHub
Project Summary

PRM800K provides 800,000 step-level correctness labels for LLM-generated solutions to MATH dataset problems. This dataset is designed for researchers and engineers developing and evaluating process-based supervision methods for improving LLM reasoning capabilities, particularly in complex mathematical problem-solving.

How It Works

The dataset consists of detailed JSON annotations for each step within LLM-generated solutions. Each step is rated by human labelers on a scale of -1 (incorrect), 0 (neutral/no progress), or +1 (correct). The data includes metadata like labeler UUIDs, timestamps, generation phase, and quality control flags. It also contains the original MATH problem, ground truth solution, and the model-generated solution with its step-by-step breakdown.

Quick Start & Requirements

  • Install: Clone the repository using Git LFS (git lfs install && git clone <repo_url>).
  • Dependencies: Python, Git LFS.
  • Data: The data/ folder contains newline-delimited JSON data.
  • Evaluation: Run python eval/eval.py --method prm for PRM evaluation or python eval/eval.py --method orm for ORM evaluation.
  • Resources: Requires sufficient disk space for Git LFS data.

Highlighted Details

  • Contains 800,000 step-level correctness labels.
  • Includes original MATH problems, ground truth, and model-generated solutions.
  • Provides human labeler instructions for data collection phases.
  • Offers Python scripts for answer grading using SymPy for expression equality.
  • Uses a custom MATH train/test split for evaluation.

Maintenance & Community

The project is associated with OpenAI and the paper "Let's Verify Step by Step." Further details and community engagement can be found via the linked blog post and paper.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes that the answer grading logic is conservative and may occasionally reject correct answers or admit incorrect ones. The dataset uses a non-standard MATH train/test split, which may affect direct comparability with other benchmarks using standard splits.

Health Check
Last commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
57 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.