oat-zero  by sail-sg

Pilot study analyzing LLM training dynamics and emergent behaviors

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project investigates the purported "Aha moment" and emergent self-reflection in DeepSeek-R1-Zero-like LLM training, offering a critical analysis for researchers and practitioners in reinforcement learning for large language models. It provides a reproduction framework and detailed findings that challenge existing interpretations, potentially refining understanding of LLM training dynamics and the nature of self-reflection.

How It Works

The project scrutinizes R1-Zero-like training methodologies, positing that the observed "Aha moment" and self-reflection patterns, specifically "Superficial Self-Reflection" (SSR), are present in base models rather than emerging solely through reinforcement learning (RL). It argues that the increase in response length often attributed to emergent skills is, in fact, a consequence of RL optimizing well-designed, rule-based reward functions. The core reproduction leverages the oat framework for efficient replication of R1-Zero-like training on tasks like Countdown, and SimpleRL for MATH task reproduction, with vLLM utilized for accelerated inference.

Quick Start & Requirements

  • Installation: Install via pip: pip install vllm==0.6.2 && pip install oat-llm. An editable install is available by cloning the repository (git clone https://github.com/sail-sg/oat.git, cd oat, pip install -e .).
  • Prerequisites: vLLM==0.6.2. Specific hardware requirements (e.g., GPU, CUDA) are not detailed but are implied for LLM execution.
  • Links: Associated blog post: https://oatllm.notion.site/oat-zero

Highlighted Details

  • Challenges the notion of "Aha moments" and emergent self-reflection in R1-Zero-like training, finding such behaviors present in base models.
  • Identifies "Superficial Self-Reflection" (SSR) where self-reflection does not guarantee correct outputs.
  • Attributes increased response length during RL optimization to reward function design, not emergent self-reflection.
  • Provides scripts for reproducing results on the Countdown task (training/run_grpo.sh) and references instructions for MATH task reproduction (simpleRL/train).

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on ongoing maintenance or prominent contributors beyond the authors of the associated blog post are provided in the README.

Licensing & Compatibility

The project is distributed under the MIT license, which is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

This work is presented as a "Pilot Study," focusing on analysis and reproduction of specific training phenomena rather than offering a production-ready system. The findings suggest that previous interpretations of R1-Zero-like training outcomes may require revision, particularly regarding the emergence of self-reflection.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
6 more.

awesome-o1 by srush

0%
1k
Bibliography for OpenAI's o1 project
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

Eureka by eureka-research

0.0%
3k
LLM-based reward design for reinforcement learning
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.