JustRL by thunlp

Scaling LLMs with a simple RL recipe

Created 8 months ago

284 stars

Top 91.8% on SourcePulse

Project Summary

JustRL presents a streamlined approach to scaling large language models (LLMs) using reinforcement learning (RL), specifically targeting 1.5B parameter models. It offers a simple, single-stage training recipe with fixed hyperparameters, achieving state-of-the-art performance on mathematical reasoning tasks. This method contrasts with complex, multi-stage pipelines, demonstrating competitive results with significantly reduced computational cost and enhanced training stability, making it valuable for researchers and practitioners seeking efficient LLM fine-tuning.

How It Works

JustRL's core innovation lies in its deliberate simplicity: a single-stage training process using standard GRPO with binary outcome rewards derived from a basic DAPO verifier (string-matching). It eschews multi-stage pipelines, dynamic schedules, and per-model hyperparameter tuning, instead relying on a fixed set of hyperparameters. This minimalist recipe ensures stable, monotonic performance improvements over extended training periods without oscillations or collapses, while achieving comparable or superior results to more complex methods with substantially less compute.

Quick Start & Requirements

Installation: Recommended via a conda environment: conda create -n justrl python=3.10 followed by conda activate justrl.
Key Dependencies: PyTorch (2.6.0), vLLM (0.8.4), transformers (4.51.3), sympy (1.13.1), pylatexenc (2.10).
Data: Requires downloading large evaluation output files from a provided Google Drive link and extracting them to the repository root.
Links: The project is associated with an ICLR 2026 Blogpost Track submission and provides a citation for the paper "JustRL: Scaling a 1.5 B LLM with a Simple RL Recipe".

Highlighted Details

Achieves state-of-the-art performance on mathematical reasoning benchmarks for 1.5B LLMs.
Delivers comparable or better results using 2x less compute than sophisticated, multi-stage RL approaches.
Demonstrates robustness and reproducibility by applying identical, fixed hyperparameters across different 1.5B base models (DeepSeek and Nemotron).
Provides complete evaluation scripts and released model weights for JustRL-DeepSeek-1.5B and JustRL-Nemotron-1.5B.

Maintenance & Community

Information regarding project maintainers, community channels (e.g., Discord, Slack), or specific development roadmaps is not detailed in the provided README excerpt.

Licensing & Compatibility

The README excerpt does not specify the software license. Consequently, compatibility for commercial use or linking with closed-source projects cannot be determined without further information.

Limitations & Caveats

The repository primarily focuses on evaluation scripts and released models, with limited explicit detail on the full training pipeline setup. The absence of a specified license presents a potential adoption blocker for commercial applications. Hardware requirements beyond core dependencies are not detailed.

JustRL by thunlp

Explore Similar Projects

gsm8k-ScRel by OFA-Sys

limit-of-RLVR by LeapLabTHU

compute-optimal-tts by RyanLiu112

ParScale by QwenLM

One-Shot-RLVR by ypwang61

Online-RLHF by RLHFlow

vime by vllm-project

DAPO by BytedTsinghua-SIA

X-R1 by dhcode-cpp

Skills by NVIDIA-NeMo

ROLL by alibaba

AReaL by areal-project