RAG-Challenge-2  by IlyaRice

RAG system for enterprise document question answering

created 4 months ago
1,745 stars

Top 25.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository offers the winning solution for the Enterprise RAG Challenge 2, designed for researchers and practitioners exploring advanced RAG techniques. It provides a robust system for question answering on company annual reports, achieving state-of-the-art results through a combination of custom PDF parsing, vector search with parent document retrieval, LLM reranking, and structured output prompting.

How It Works

The system employs a multi-stage RAG pipeline. It begins with custom PDF parsing using Docling, followed by vector search enhanced with parent document retrieval to improve context relevance. A crucial step involves LLM reranking to further refine the retrieved context. Finally, it utilizes structured output prompting with chain-of-thought reasoning and query routing for complex comparisons, aiming for accurate and contextually rich answers.

Quick Start & Requirements

  • Install: git clone https://github.com/IlyaRice/RAG-Challenge-2.git, cd RAG-Challenge-2, python -m venv venv, venv\Scripts\Activate.ps1 (Windows), pip install -e . -r requirements.txt.
  • Prerequisites: OpenAI/Gemini API keys. A GPU is highly recommended for PDF parsing (e.g., RTX 4090).
  • Setup: Rename env to .env and add API keys.
  • Datasets: Includes a small test set (data/test_set/) and the full competition dataset (data/erc2_set/). Refer to dataset-specific READMEs for details.
  • Usage: Run pipeline stages by uncommenting in src/pipeline.py or via main.py CLI commands (e.g., python main.py parse-pdfs).
  • Docs: Dataset READMEs: data/test_set/README.md, data/erc2_set/README.md.

Highlighted Details

  • Won all categories in the Enterprise RAG Challenge 2.
  • Features custom PDF parsing with Docling.
  • Implements vector search with parent document retrieval.
  • Utilizes LLM reranking for improved context relevance.
  • Employs structured output prompting with chain-of-thought reasoning.

Maintenance & Community

The project is presented as competition code with "rough edges and weird workarounds." It lacks tests and has minimal error handling, indicating it is not production-ready. No specific community channels or roadmap are mentioned.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Suitable for commercial use and closed-source linking due to the permissive MIT license.

Limitations & Caveats

This code is described as "scrappy" and not production-ready, featuring rough edges, workarounds, no tests, and minimal error handling. IBM Watson integration is non-functional as it was competition-specific. Users must provide their own API keys.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
1,399 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.