s1  by simplescaling

Test-time scaling recipe for strong reasoning performance

created 6 months ago
6,516 stars

Top 8.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the artifacts for "s1: Simple test-time scaling," a method for enhancing Large Language Model (LLM) reasoning performance with minimal data. It targets researchers and practitioners seeking to improve LLM capabilities through efficient fine-tuning and inference techniques, offering strong reasoning performance with a small dataset and budget.

How It Works

The core innovation lies in "test-time scaling," a technique that involves fine-tuning an LLM on a small, curated dataset (s1K) of reasoning examples. This approach leverages budget forcing during inference, where the model's generation is constrained by a token limit for its "thinking" process. This encourages more focused and efficient reasoning, leading to improved accuracy on complex tasks.

Quick Start & Requirements

  • Inference: Use vLLM or transformers libraries.
    • vLLM example: pip install vllm transformers then run provided Python code.
    • transformers example: pip install transformers torch then run provided Python code.
  • Training: Requires 16 H100 GPUs (2 nodes x 8 GPUs). Recommended to clone the repo, install requirements (pip3 install -r requirements.txt), and run bash train/sft.sh. Gradient checkpointing can be enabled for OOM issues.
  • Evaluation: Requires cloning and modifying lm-evaluation-harness (cd eval/lm-evaluation-harness && pip install -e .[math,vllm]).
  • Data: Requires Python scripts for data collection, trace generation (Gemini), inference (Qwen), featurization, and filtering.
  • Resources: Training requires significant GPU resources. Inference setup is straightforward with standard libraries.
  • Links: Paper: https://arxiv.org/abs/2501.19393, Models: https://hf.co/simplescaling/s1.1-32B, Data: https://hf.co/datasets/simplescaling/s1K-1.1

Highlighted Details

  • Achieves strong reasoning performance with only 1,000 examples.
  • Introduces "budget forcing" for constrained test-time inference.
  • Offers s1.1-32B model and s1K-1.1 dataset using reasoning traces from r1.
  • Evaluation scripts are based on a modified lm-evaluation-harness.

Maintenance & Community

  • Project led by Niklas Muennighoff.
  • Updates include release of paper, models, and datasets.
  • Links to TWIML Podcast and Microsoft GenAI Talk are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The models and datasets are hosted on Hugging Face, which typically uses Apache 2.0 or similar permissive licenses unless otherwise specified.

Limitations & Caveats

  • A known issue with vLLM can cause ValueError during budget forcing with specific token IDs; a workaround is suggested by uncommenting a line in vllm/engine/llm_engine.py.
  • Training requires substantial GPU resources (16 H100s recommended).
  • Data generation scripts require manual organization renaming.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
228 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.