oat-zero by sail-sg

Pilot study analyzing LLM training dynamics and emergent behaviors

Created 1 year ago

250 stars

Top 100.0% on SourcePulse

Project Summary

This project investigates the purported "Aha moment" and emergent self-reflection in DeepSeek-R1-Zero-like LLM training, offering a critical analysis for researchers and practitioners in reinforcement learning for large language models. It provides a reproduction framework and detailed findings that challenge existing interpretations, potentially refining understanding of LLM training dynamics and the nature of self-reflection.

How It Works

The project scrutinizes R1-Zero-like training methodologies, positing that the observed "Aha moment" and self-reflection patterns, specifically "Superficial Self-Reflection" (SSR), are present in base models rather than emerging solely through reinforcement learning (RL). It argues that the increase in response length often attributed to emergent skills is, in fact, a consequence of RL optimizing well-designed, rule-based reward functions. The core reproduction leverages the oat framework for efficient replication of R1-Zero-like training on tasks like Countdown, and SimpleRL for MATH task reproduction, with vLLM utilized for accelerated inference.

Quick Start & Requirements

Installation: Install via pip: pip install vllm==0.6.2 && pip install oat-llm. An editable install is available by cloning the repository (git clone https://github.com/sail-sg/oat.git, cd oat, pip install -e .).
Prerequisites: vLLM==0.6.2. Specific hardware requirements (e.g., GPU, CUDA) are not detailed but are implied for LLM execution.
Links: Associated blog post: https://oatllm.notion.site/oat-zero

Highlighted Details

Challenges the notion of "Aha moments" and emergent self-reflection in R1-Zero-like training, finding such behaviors present in base models.
Identifies "Superficial Self-Reflection" (SSR) where self-reflection does not guarantee correct outputs.
Attributes increased response length during RL optimization to reward function design, not emergent self-reflection.
Provides scripts for reproducing results on the Countdown task (training/run_grpo.sh) and references instructions for MATH task reproduction (simpleRL/train).

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on ongoing maintenance or prominent contributors beyond the authors of the associated blog post are provided in the README.

Licensing & Compatibility

The project is distributed under the MIT license, which is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

This work is presented as a "Pilot Study," focusing on analysis and reproduction of specific training phenomena rather than offering a production-ready system. The findings suggest that previous interpretations of R1-Zero-like training outcomes may require revision, particularly regarding the emergence of self-reflection.

oat-zero by sail-sg

Explore Similar Projects

awesome-in-context-rl by dunnolab

Entropy-Mechanism-of-RL by PRIME-RL

Ctx2Skill by S1s-Z

ReST-MCTS by THUDM

awesome-o1 by srush

LlamaGym by KhoomeiK

Self-Distillation by idanshen

discover by test-time-training

SDPO by lasgroup

Awesome-LLM-Post-training by mbzuai-oryx

agentic-context-engine by kayba-ai

Eureka by eureka-research