R1-Searcher by RUCAIBox

RL framework for incentivizing LLM search via outcome supervision

Created 10 months ago

668 stars

Top 50.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

R1-searcher enables Large Reasoning Models (LRMs) to effectively invoke and utilize web search for knowledge-intensive tasks. It targets researchers and developers aiming to enhance LLM reasoning capabilities, particularly for multi-hop and time-sensitive questions, by providing a reinforcement learning framework that doesn't require instruction fine-tuning.

How It Works

The project employs a two-stage outcome-supervised reinforcement learning approach. Stage 1 trains the model to invoke search using only format rewards. Stage 2 further refines this by teaching the model to effectively use retrieved information, incorporating both format and answer rewards. This method leverages Reinforce++ as the RL algorithm and relies on carefully designed rewards to guide the learning process, avoiding complex prompt engineering or process supervision.

Quick Start & Requirements

Install: Create a conda environment (conda create --name r1-searcher python=3.10.16), activate it, and install dependencies: pip install vllm==0.6.5 packaging ninja flash-attn deepspeed accelerate datasets.
Prerequisites: Python 3.10.16, CUDA-enabled GPU (for flash-attn, vllm, and embedding/indexing).
Data Prep: Requires downloading and indexing Wikipedia corpus (KILT dataset).
Resources: Training involves multiple servers for Ray, reward servers, and model rollouts. Evaluation requires local search setup and potentially online search access.
Links: Arxiv, Model Checkpoints, Training Data.

Highlighted Details

Achieves significant performance improvements on benchmarks like HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle, outperforming existing methods and even closed-source models like GPT-4o-mini.
Demonstrates strong generalization capabilities to out-of-domain datasets and online search scenarios.
LongCoT reasoning after RL is presented as a more efficient scaling method than tree-search approaches.
Compatible with both Base LLMs and Chat LLMs, and can train from scratch on Base LLMs.

Maintenance & Community

The project is associated with RUCAIBox, a research group. Contact email provided for questions. No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that the capability of the Base LLMs largely influences whether the model can start training directly from zero. Performance on online search is evaluated for Bamboogle, but broader online search integration capabilities are not extensively detailed.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days