Diver  by AQ-MedAI

Advanced RAG for complex reasoning

Created 6 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary DIVER is an open-source retrieval pipeline addressing RAG limitations with complex, multi-step reasoning queries. It offers a sophisticated multi-stage approach for information retrieval, achieving state-of-the-art performance on reasoning-intensive benchmarks like BRIGHT. This system is valuable for engineers developing advanced question-answering and knowledge retrieval systems requiring deep reasoning.

How It Works DIVER employs a four-stage architecture: document pre-processing, iterative LLM-driven query expansion, a specialized retriever fine-tuned on synthetic reasoning data, and a novel reranker fusing retrieval scores with LLM helpfulness ratings. This approach targets reasoning-intensive tasks. Key features include LLM-driven query expansion for intelligent refinement, a reasoning-enhanced retriever understanding complex relationships, and a merged reranker combining traditional scores with LLM "helpfulness" for superior ranking.

Quick Start & Requirements Reproduction involves downloading the BRIGHT dataset and models, then executing scripts for query expansion, retrieval, and reranking, or using run_all.sh. Key dependencies include Python, transformers, sentence-transformers; GPU acceleration is essential. Finetuning uses ms-swift. Official models are on Hugging Face/ModelScope. Links to arXiv paper and dataset are provided.

Highlighted Details DIVER V2 achieves state-of-the-art NDCG@10 of 45.8 on BRIGHT, with GroupRank-32B reaching 46.8. Retriever models (0.6B-4B) offer performance trade-offs; 4B-1020 scores 31.9. Core innovations are LLM-driven query expansion, a reasoning-enhanced retriever, and a merged reranker leveraging LLM helpfulness alongside traditional metrics.

Maintenance & Community Active development is indicated by a "TODO List" for releases like DIVER-Reranker. One retriever model has over 2.6k monthly Hugging Face downloads, showing community interest. Specific community channels (e.g., Discord) are not detailed.

Licensing & Compatibility The provided README does not explicitly state the software license. This omission requires clarification for adoption, particularly regarding commercial use or integration into proprietary systems.

Limitations & Caveats The project is under active development. The absence of explicit license information is a significant adoption blocker. Reproduction requires the BRIGHT dataset and potentially substantial GPU resources for larger models.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nir Gazit Nir Gazit(Cofounder of Traceloop), and
4 more.

llmware by llmware-ai

0.1%
15k
Framework for enterprise RAG pipelines using small, specialized models
Created 2 years ago
Updated 4 days ago
Feedback? Help us improve.