Diver by AQ-MedAI

Advanced RAG for complex reasoning

Created 10 months ago

258 stars

Top 98.0% on SourcePulse

Project Summary

Summary DIVER is an open-source retrieval pipeline addressing RAG limitations with complex, multi-step reasoning queries. It offers a sophisticated multi-stage approach for information retrieval, achieving state-of-the-art performance on reasoning-intensive benchmarks like BRIGHT. This system is valuable for engineers developing advanced question-answering and knowledge retrieval systems requiring deep reasoning.

How It Works DIVER employs a four-stage architecture: document pre-processing, iterative LLM-driven query expansion, a specialized retriever fine-tuned on synthetic reasoning data, and a novel reranker fusing retrieval scores with LLM helpfulness ratings. This approach targets reasoning-intensive tasks. Key features include LLM-driven query expansion for intelligent refinement, a reasoning-enhanced retriever understanding complex relationships, and a merged reranker combining traditional scores with LLM "helpfulness" for superior ranking.

Quick Start & Requirements Reproduction involves downloading the BRIGHT dataset and models, then executing scripts for query expansion, retrieval, and reranking, or using run_all.sh. Key dependencies include Python, transformers, sentence-transformers; GPU acceleration is essential. Finetuning uses ms-swift. Official models are on Hugging Face/ModelScope. Links to arXiv paper and dataset are provided.

Highlighted Details DIVER V2 achieves state-of-the-art NDCG@10 of 45.8 on BRIGHT, with GroupRank-32B reaching 46.8. Retriever models (0.6B-4B) offer performance trade-offs; 4B-1020 scores 31.9. Core innovations are LLM-driven query expansion, a reasoning-enhanced retriever, and a merged reranker leveraging LLM helpfulness alongside traditional metrics.

Maintenance & Community Active development is indicated by a "TODO List" for releases like DIVER-Reranker. One retriever model has over 2.6k monthly Hugging Face downloads, showing community interest. Specific community channels (e.g., Discord) are not detailed.

Licensing & Compatibility The provided README does not explicitly state the software license. This omission requires clarification for adoption, particularly regarding commercial use or integration into proprietary systems.

Limitations & Caveats The project is under active development. The absence of explicit license information is a significant adoption blocker. Reproduction requires the BRIGHT dataset and potentially substantial GPU resources for larger models.

Diver by AQ-MedAI

Explore Similar Projects

LeanRAG by KnowledgeXLab

ChatKBQA by LHRLAB

KG-LLM-MDQA by yuwvandy

embedding_rerank_retrieval by percent4

MultiHop-RAG by yixuantt

ircot by StonyBrookNLP

LLM4IR-Survey by RUC-NLPIR

stark by snap-stanford

opencraig by opencraig

Search-o1 by RUC-NLPIR

TrustRAG by gomate-community

llmware by llmware-ai