Research paper code for compute-optimal test-time scaling of LLMs
Top 96.5% on sourcepulse
This repository provides the official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling." It enables researchers and practitioners to explore and implement test-time scaling strategies for Large Language Models (LLMs) in mathematical reasoning tasks, aiming to improve performance without retraining.
How It Works
The project implements several test-time scaling (TTS) methods, including Chain-of-Thought (CoT), Best-of-N (BoN), Beam Search, and Diverse Beam Search (DVTS). These methods leverage policy models (LLMs) and process reward models (PRMs) to enhance reasoning capabilities. The core idea is to scale computation at inference time by generating multiple candidate solutions and selecting the best one, thereby optimizing performance for a given compute budget.
Quick Start & Requirements
conda create -n tts python=3.10
), activate it, and install dependencies (pip install -r requirements.txt
, flash-attn
, ray[default]==2.38.0
, fschat[model_worker,webui]
, sympy==1.12
, and the latex2sympy
package).flash-attn
), tmux
, and specific versions of ray
and fschat
. GPU configurations range from 1x A100 80GB to 4x A100 80GB depending on model sizes.Highlighted Details
Maintenance & Community
The project is associated with authors from multiple institutions and has received media coverage from QbitAI and AI Era. It is actively maintained, with code released in February 2025.
Licensing & Compatibility
The repository is released under the Apache-2.0 license, permitting commercial use and linking with closed-source projects.
Limitations & Caveats
The mathematical expression evaluation is based on Qwen2.5-Math; for more advanced evaluation, users are directed to the Math-Verify repository. The README notes that for BoN and DVTS, average results are not computed by default and require post-processing.
5 months ago
1 day