s1  by simplescaling

Test-time scaling recipe for strong reasoning performance

Created 7 months ago
6,548 stars

Top 7.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the artifacts for "s1: Simple test-time scaling," a method for enhancing Large Language Model (LLM) reasoning performance with minimal data. It targets researchers and practitioners seeking to improve LLM capabilities through efficient fine-tuning and inference techniques, offering strong reasoning performance with a small dataset and budget.

How It Works

The core innovation lies in "test-time scaling," a technique that involves fine-tuning an LLM on a small, curated dataset (s1K) of reasoning examples. This approach leverages budget forcing during inference, where the model's generation is constrained by a token limit for its "thinking" process. This encourages more focused and efficient reasoning, leading to improved accuracy on complex tasks.

Quick Start & Requirements

  • Inference: Use vLLM or transformers libraries.
    • vLLM example: pip install vllm transformers then run provided Python code.
    • transformers example: pip install transformers torch then run provided Python code.
  • Training: Requires 16 H100 GPUs (2 nodes x 8 GPUs). Recommended to clone the repo, install requirements (pip3 install -r requirements.txt), and run bash train/sft.sh. Gradient checkpointing can be enabled for OOM issues.
  • Evaluation: Requires cloning and modifying lm-evaluation-harness (cd eval/lm-evaluation-harness && pip install -e .[math,vllm]).
  • Data: Requires Python scripts for data collection, trace generation (Gemini), inference (Qwen), featurization, and filtering.
  • Resources: Training requires significant GPU resources. Inference setup is straightforward with standard libraries.
  • Links: Paper: https://arxiv.org/abs/2501.19393, Models: https://hf.co/simplescaling/s1.1-32B, Data: https://hf.co/datasets/simplescaling/s1K-1.1

Highlighted Details

  • Achieves strong reasoning performance with only 1,000 examples.
  • Introduces "budget forcing" for constrained test-time inference.
  • Offers s1.1-32B model and s1K-1.1 dataset using reasoning traces from r1.
  • Evaluation scripts are based on a modified lm-evaluation-harness.

Maintenance & Community

  • Project led by Niklas Muennighoff.
  • Updates include release of paper, models, and datasets.
  • Links to TWIML Podcast and Microsoft GenAI Talk are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The models and datasets are hosted on Hugging Face, which typically uses Apache 2.0 or similar permissive licenses unless otherwise specified.

Limitations & Caveats

  • A known issue with vLLM can cause ValueError during budget forcing with specific token IDs; a workaround is suggested by uncommenting a line in vllm/engine/llm_engine.py.
  • Training requires substantial GPU resources (16 H100s recommended).
  • Data generation scripts require manual organization renaming.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), and
11 more.

open-instruct by allenai

0.7%
3k
Training codebase for instruction-following language models
Created 2 years ago
Updated 15 hours ago
Feedback? Help us improve.