RL framework for incentivizing LLM search via outcome supervision
Top 55.0% on sourcepulse
R1-searcher enables Large Reasoning Models (LRMs) to effectively invoke and utilize web search for knowledge-intensive tasks. It targets researchers and developers aiming to enhance LLM reasoning capabilities, particularly for multi-hop and time-sensitive questions, by providing a reinforcement learning framework that doesn't require instruction fine-tuning.
How It Works
The project employs a two-stage outcome-supervised reinforcement learning approach. Stage 1 trains the model to invoke search using only format rewards. Stage 2 further refines this by teaching the model to effectively use retrieved information, incorporating both format and answer rewards. This method leverages Reinforce++ as the RL algorithm and relies on carefully designed rewards to guide the learning process, avoiding complex prompt engineering or process supervision.
Quick Start & Requirements
conda create --name r1-searcher python=3.10.16
), activate it, and install dependencies: pip install vllm==0.6.5 packaging ninja flash-attn deepspeed accelerate datasets
.flash-attn
, vllm
, and embedding/indexing).Highlighted Details
Maintenance & Community
The project is associated with RUCAIBox, a research group. Contact email provided for questions. No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README notes that the capability of the Base LLMs largely influences whether the model can start training directly from zero. Performance on online search is evaluated for Bamboogle, but broader online search integration capabilities are not extensively detailed.
2 months ago
1 week