s1 by simplescaling

Test-time scaling recipe for strong reasoning performance

Created 11 months ago

6,625 stars

Top 7.7% on SourcePulse

View on GitHub

13 Experts Love This Project

Lewis Tunstall

Research Engineer at Hugging Face

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Taranjeet Singh

Cofounder of Mem0

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 9 more!

Project Summary

This repository provides the artifacts for "s1: Simple test-time scaling," a method for enhancing Large Language Model (LLM) reasoning performance with minimal data. It targets researchers and practitioners seeking to improve LLM capabilities through efficient fine-tuning and inference techniques, offering strong reasoning performance with a small dataset and budget.

How It Works

The core innovation lies in "test-time scaling," a technique that involves fine-tuning an LLM on a small, curated dataset (s1K) of reasoning examples. This approach leverages budget forcing during inference, where the model's generation is constrained by a token limit for its "thinking" process. This encourages more focused and efficient reasoning, leading to improved accuracy on complex tasks.

Quick Start & Requirements

Inference: Use vLLM or transformers libraries.
- vLLM example: pip install vllm transformers then run provided Python code.
- transformers example: pip install transformers torch then run provided Python code.
Training: Requires 16 H100 GPUs (2 nodes x 8 GPUs). Recommended to clone the repo, install requirements (pip3 install -r requirements.txt), and run bash train/sft.sh. Gradient checkpointing can be enabled for OOM issues.
Evaluation: Requires cloning and modifying lm-evaluation-harness (cd eval/lm-evaluation-harness && pip install -e .[math,vllm]).
Data: Requires Python scripts for data collection, trace generation (Gemini), inference (Qwen), featurization, and filtering.
Resources: Training requires significant GPU resources. Inference setup is straightforward with standard libraries.
Links: Paper: https://arxiv.org/abs/2501.19393, Models: https://hf.co/simplescaling/s1.1-32B, Data: https://hf.co/datasets/simplescaling/s1K-1.1

Highlighted Details

Achieves strong reasoning performance with only 1,000 examples.
Introduces "budget forcing" for constrained test-time inference.
Offers s1.1-32B model and s1K-1.1 dataset using reasoning traces from r1.
Evaluation scripts are based on a modified lm-evaluation-harness.

Maintenance & Community

Project led by Niklas Muennighoff.
Updates include release of paper, models, and datasets.
Links to TWIML Podcast and Microsoft GenAI Talk are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license. The models and datasets are hosted on Hugging Face, which typically uses Apache 2.0 or similar permissive licenses unless otherwise specified.

Limitations & Caveats

A known issue with vLLM can cause ValueError during budget forcing with specific token IDs; a workaround is suggested by uncommenting a line in vllm/engine/llm_engine.py.
Training requires substantial GPU resources (16 H100s recommended).
Data generation scripts require manual organization renaming.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days