R1-Searcher  by RUCAIBox

RL framework for incentivizing LLM search via outcome supervision

Created 6 months ago
633 stars

Top 52.3% on SourcePulse

GitHubView on GitHub
Project Summary

R1-searcher enables Large Reasoning Models (LRMs) to effectively invoke and utilize web search for knowledge-intensive tasks. It targets researchers and developers aiming to enhance LLM reasoning capabilities, particularly for multi-hop and time-sensitive questions, by providing a reinforcement learning framework that doesn't require instruction fine-tuning.

How It Works

The project employs a two-stage outcome-supervised reinforcement learning approach. Stage 1 trains the model to invoke search using only format rewards. Stage 2 further refines this by teaching the model to effectively use retrieved information, incorporating both format and answer rewards. This method leverages Reinforce++ as the RL algorithm and relies on carefully designed rewards to guide the learning process, avoiding complex prompt engineering or process supervision.

Quick Start & Requirements

  • Install: Create a conda environment (conda create --name r1-searcher python=3.10.16), activate it, and install dependencies: pip install vllm==0.6.5 packaging ninja flash-attn deepspeed accelerate datasets.
  • Prerequisites: Python 3.10.16, CUDA-enabled GPU (for flash-attn, vllm, and embedding/indexing).
  • Data Prep: Requires downloading and indexing Wikipedia corpus (KILT dataset).
  • Resources: Training involves multiple servers for Ray, reward servers, and model rollouts. Evaluation requires local search setup and potentially online search access.
  • Links: Arxiv, Model Checkpoints, Training Data.

Highlighted Details

  • Achieves significant performance improvements on benchmarks like HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle, outperforming existing methods and even closed-source models like GPT-4o-mini.
  • Demonstrates strong generalization capabilities to out-of-domain datasets and online search scenarios.
  • LongCoT reasoning after RL is presented as a more efficient scaling method than tree-search approaches.
  • Compatible with both Base LLMs and Chat LLMs, and can train from scratch on Base LLMs.

Maintenance & Community

The project is associated with RUCAIBox, a research group. Contact email provided for questions. No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that the capability of the Base LLMs largely influences whether the model can start training directly from zero. Performance on online search is evaluated for Bamboogle, but broader online search integration capabilities are not extensively detailed.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
11 more.

TinyZero by Jiayi-Pan

0.2%
12k
Minimal reproduction of DeepSeek R1 Zero for countdown/multiplication tasks
Created 8 months ago
Updated 4 months ago
Feedback? Help us improve.