DeepRetrieval  by pat-jj

RL training for LLM query generation to improve information retrieval

created 5 months ago
601 stars

Top 55.2% on sourcepulse

GitHubView on GitHub
Project Summary

DeepRetrieval trains Large Language Models (LLMs) for query generation using reinforcement learning, enabling them to discover optimal search queries through trial and error. This approach eliminates the need for supervised query-augmentation pairs and significantly boosts information retrieval performance across various domains, making it valuable for researchers and developers seeking to enhance search capabilities.

How It Works

The system employs an LLM that generates a reasoning step within a <think> tag, followed by the final augmented query in an <answer> tag. This structured output facilitates explicit chain-of-thought reasoning. Retrieval metrics serve as rewards, guiding the LLM to iteratively refine queries for maximum retrieval effectiveness, a novel departure from traditional supervised methods.

Quick Start & Requirements

  • Installation: conda create -n zero python=3.9, pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121, pip3 install vllm==0.6.3, pip3 install ray, cd code, pip install -e ., pip3 install flash-attn --no-build-isolation, pip install wandb. Additional installs for sparse/dense retrieval: pip install pyserini, pip install faiss-gpu==1.7.2. SQL support: pip install func_timeout.
  • Prerequisites: Python 3.9+, PyTorch with CUDA 12.1, vLLM, Ray, FlashAttention 2, Wandb. Optional: Pyserini, FAISS-GPU, Java 11.
  • Data: Pre-processed datasets available on Huggingface (DeepRetrieval/datasets) or process raw data.
  • API Keys: Required for search engine integration (e.g., PubMed API key).
  • Resources: Training logs suggest potential VRAM limitations, recommending critic.model.enable_gradient_checkpointing=True.
  • Links: Huggingface Datasets, [PubMed API Instructions](https://www.ncbi.nlm.nih.gov/books/NB সংরক্ষণ_API/), arXiv Paper.

Highlighted Details

  • Achieves 65.07% recall on publication search and 63.18% on clinical trials search, significantly outperforming prior SOTA (24.68% and 32.11% respectively).
  • Demonstrates strong performance with a 3B parameter model, surpassing larger models like GPT-4o and Claude-3.5-Sonnet.
  • Versatile across literature search, evidence-seeking, classic IR, and SQL database search.
  • Eliminates the need for supervised query-augmentation pairs.

Maintenance & Community

The project is primarily based on verl and PySerini. The base model used in experiments is Qwen2.5-3B-Instruct. Star the repository to stay updated.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a 2025 arXiv preprint, indicating it may be experimental or pre-release. Specific VRAM requirements for training might necessitate gradient checkpointing.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
127 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.