DeepRetrieval by pat-jj

RL training for LLM query generation to improve information retrieval

Created 10 months ago

686 stars

Top 49.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Huber

Cofounder of Chroma

Project Summary

DeepRetrieval trains Large Language Models (LLMs) for query generation using reinforcement learning, enabling them to discover optimal search queries through trial and error. This approach eliminates the need for supervised query-augmentation pairs and significantly boosts information retrieval performance across various domains, making it valuable for researchers and developers seeking to enhance search capabilities.

How It Works

The system employs an LLM that generates a reasoning step within a <think> tag, followed by the final augmented query in an <answer> tag. This structured output facilitates explicit chain-of-thought reasoning. Retrieval metrics serve as rewards, guiding the LLM to iteratively refine queries for maximum retrieval effectiveness, a novel departure from traditional supervised methods.

Quick Start & Requirements

Installation: conda create -n zero python=3.9, pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121, pip3 install vllm==0.6.3, pip3 install ray, cd code, pip install -e ., pip3 install flash-attn --no-build-isolation, pip install wandb. Additional installs for sparse/dense retrieval: pip install pyserini, pip install faiss-gpu==1.7.2. SQL support: pip install func_timeout.
Prerequisites: Python 3.9+, PyTorch with CUDA 12.1, vLLM, Ray, FlashAttention 2, Wandb. Optional: Pyserini, FAISS-GPU, Java 11.
Data: Pre-processed datasets available on Huggingface (DeepRetrieval/datasets) or process raw data.
API Keys: Required for search engine integration (e.g., PubMed API key).
Resources: Training logs suggest potential VRAM limitations, recommending critic.model.enable_gradient_checkpointing=True.
Links: Huggingface Datasets, [PubMed API Instructions](https://www.ncbi.nlm.nih.gov/books/NB সংরক্ষণ_API/), arXiv Paper.

Highlighted Details

Achieves 65.07% recall on publication search and 63.18% on clinical trials search, significantly outperforming prior SOTA (24.68% and 32.11% respectively).
Demonstrates strong performance with a 3B parameter model, surpassing larger models like GPT-4o and Claude-3.5-Sonnet.
Versatile across literature search, evidence-seeking, classic IR, and SQL database search.
Eliminates the need for supervised query-augmentation pairs.

Maintenance & Community

The project is primarily based on verl and PySerini. The base model used in experiments is Qwen2.5-3B-Instruct. Star the repository to stay updated.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a 2025 arXiv preprint, indicating it may be experimental or pre-release. Specific VRAM requirements for training might necessitate gradient checkpointing.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days