LongRAG  by TIGER-AI-Lab

RAG framework enhanced with long-context LLMs

Created 2 years ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

LongRAG addresses the performance limitations of traditional Retrieval-Augmented Generation (RAG) frameworks by rebalancing the workload between retrieval and reading components. It introduces a novel approach using significantly longer retrieval units (4K tokens) and supporting advanced long-context Large Language Models (LLMs) as readers. This framework is designed for researchers and practitioners seeking to enhance RAG systems, offering improved information completeness and potentially higher accuracy in complex question-answering tasks.

How It Works

LongRAG replaces the conventional RAG design, which often relies on short retrieval units and burdens the retriever with extensive searching, with a dual "long retriever" and "long reader" architecture. By employing retrieval units that are approximately 30 times longer (4K tokens), the system aims to provide richer context per unit. This allows the reader LLM to process more comprehensive information, reducing ambiguity and improving the overall efficiency and effectiveness of the RAG pipeline. The approach leverages established dense retrieval toolkits and state-of-the-art long-context LLMs.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/TIGER-AI-Lab/LongRAG.git), navigate into the directory (cd LongRAG), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Corpus preparation involves Python scripts and potentially large datasets (e.g., Wikipedia dumps). Retrieval encoding (scripts/run_retrieve_tevatron.sh) suggests a need for multiple GPUs (example uses 4). Evaluating the reader (scripts/run_eval_qa.sh) requires API keys and configurations for supported LLMs (GPT-4o, GPT-4-Turbo, Gemini-1.5-Pro, Claude-3-Opus).
  • Resources: Corpus preparation and retrieval encoding can be resource-intensive.
  • Links:

Highlighted Details

  • Employs 4K-token retrieval units, a 30x increase over typical RAG designs.
  • Integrates with the Tevatron toolkit for dense retrieval.
  • Supports advanced long-context LLMs such as Gemini-1.5-Pro and GPT-4o as readers.
  • Reports top-1 retrieval accuracy of 88% and exact match rate of 64% on sample evaluations.

Maintenance & Community

The project is associated with authors Ziyan Jiang, Xueguang Ma, and Wenhu Chen. The repository is noted as still undergoing polishing. No specific community channels (e.g., Discord, Slack) or roadmap links are provided.

Licensing & Compatibility

  • NQ Dataset: Apache License 2.0 (permissive).
  • HotpotQA Dataset: CC BY-SA 4.0 License (copyleft, requires derivative works to be shared under the same license).
  • Compatibility: The copyleft nature of the CC BY-SA 4.0 license may impose restrictions on integrating this work into closed-source or proprietary systems.

Limitations & Caveats

The repository is explicitly stated to be in the process of being polished, indicating potential for ongoing changes. Support for additional LLMs is planned but not yet implemented. Users must configure API keys and settings for specific reader models, and the corpus preparation steps can be complex and resource-intensive.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.