Test-time scaling recipe for strong reasoning performance
Top 8.0% on sourcepulse
This repository provides the artifacts for "s1: Simple test-time scaling," a method for enhancing Large Language Model (LLM) reasoning performance with minimal data. It targets researchers and practitioners seeking to improve LLM capabilities through efficient fine-tuning and inference techniques, offering strong reasoning performance with a small dataset and budget.
How It Works
The core innovation lies in "test-time scaling," a technique that involves fine-tuning an LLM on a small, curated dataset (s1K) of reasoning examples. This approach leverages budget forcing during inference, where the model's generation is constrained by a token limit for its "thinking" process. This encourages more focused and efficient reasoning, leading to improved accuracy on complex tasks.
Quick Start & Requirements
vLLM
or transformers
libraries.
vLLM
example: pip install vllm transformers
then run provided Python code.transformers
example: pip install transformers torch
then run provided Python code.pip3 install -r requirements.txt
), and run bash train/sft.sh
. Gradient checkpointing can be enabled for OOM issues.lm-evaluation-harness
(cd eval/lm-evaluation-harness && pip install -e .[math,vllm]
).Highlighted Details
s1.1-32B
model and s1K-1.1
dataset using reasoning traces from r1
.lm-evaluation-harness
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
vLLM
can cause ValueError
during budget forcing with specific token IDs; a workaround is suggested by uncommenting a line in vllm/engine/llm_engine.py
.1 month ago
1 day