LongRAG by TIGER-AI-Lab

RAG framework enhanced with long-context LLMs

Created 2 years ago

250 stars

Top 100.0% on SourcePulse

Project Summary

LongRAG addresses the performance limitations of traditional Retrieval-Augmented Generation (RAG) frameworks by rebalancing the workload between retrieval and reading components. It introduces a novel approach using significantly longer retrieval units (4K tokens) and supporting advanced long-context Large Language Models (LLMs) as readers. This framework is designed for researchers and practitioners seeking to enhance RAG systems, offering improved information completeness and potentially higher accuracy in complex question-answering tasks.

How It Works

LongRAG replaces the conventional RAG design, which often relies on short retrieval units and burdens the retriever with extensive searching, with a dual "long retriever" and "long reader" architecture. By employing retrieval units that are approximately 30 times longer (4K tokens), the system aims to provide richer context per unit. This allows the reader LLM to process more comprehensive information, reducing ambiguity and improving the overall efficiency and effectiveness of the RAG pipeline. The approach leverages established dense retrieval toolkits and state-of-the-art long-context LLMs.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/TIGER-AI-Lab/LongRAG.git), navigate into the directory (cd LongRAG), and install dependencies (pip install -r requirements.txt).
Prerequisites: Corpus preparation involves Python scripts and potentially large datasets (e.g., Wikipedia dumps). Retrieval encoding (scripts/run_retrieve_tevatron.sh) suggests a need for multiple GPUs (example uses 4). Evaluating the reader (scripts/run_eval_qa.sh) requires API keys and configurations for supported LLMs (GPT-4o, GPT-4-Turbo, Gemini-1.5-Pro, Claude-3-Opus).
Resources: Corpus preparation and retrieval encoding can be resource-intensive.
Links:
- Processed Corpus (NQ, HotpotQA): https://huggingface.co/TIGER-Lab/LongRAG
- DPR titles: Link provided in README.

Highlighted Details

Employs 4K-token retrieval units, a 30x increase over typical RAG designs.
Integrates with the Tevatron toolkit for dense retrieval.
Supports advanced long-context LLMs such as Gemini-1.5-Pro and GPT-4o as readers.
Reports top-1 retrieval accuracy of 88% and exact match rate of 64% on sample evaluations.

Maintenance & Community

The project is associated with authors Ziyan Jiang, Xueguang Ma, and Wenhu Chen. The repository is noted as still undergoing polishing. No specific community channels (e.g., Discord, Slack) or roadmap links are provided.

Licensing & Compatibility

NQ Dataset: Apache License 2.0 (permissive).
HotpotQA Dataset: CC BY-SA 4.0 License (copyleft, requires derivative works to be shared under the same license).
Compatibility: The copyleft nature of the CC BY-SA 4.0 license may impose restrictions on integrating this work into closed-source or proprietary systems.

Limitations & Caveats

The repository is explicitly stated to be in the process of being polished, indicating potential for ongoing changes. Support for additional LLMs is planned but not yet implemented. Users must configure API keys and settings for specific reader models, and the corpus preparation steps can be complex and resource-intensive.

LongRAG by TIGER-AI-Lab

Explore Similar Projects

KG-LLM-MDQA by yuwvandy

rag-all-in-one by lehoanglong95

EpsteinFiles-RAG by AnkitNayak-eth

atlas by facebookresearch

HiRAG by hhy-huang

Awesome-LLM-RAG by jxzhangjhu

RAG-Interview-Questions-and-Answers-Hub by KalyanKS-NLP

advanced-rag by guyernest

raptor by parthsarthi03

Chinese-LangChain by yanqiangmiffy

SimpleMem by aiming-lab

ColBERT by stanford-futuredata