RAG-Challenge-2 by IlyaRice

RAG system for enterprise document question answering

Created 9 months ago

2,085 stars

Top 21.2% on SourcePulse

Project Summary

This repository offers the winning solution for the Enterprise RAG Challenge 2, designed for researchers and practitioners exploring advanced RAG techniques. It provides a robust system for question answering on company annual reports, achieving state-of-the-art results through a combination of custom PDF parsing, vector search with parent document retrieval, LLM reranking, and structured output prompting.

How It Works

The system employs a multi-stage RAG pipeline. It begins with custom PDF parsing using Docling, followed by vector search enhanced with parent document retrieval to improve context relevance. A crucial step involves LLM reranking to further refine the retrieved context. Finally, it utilizes structured output prompting with chain-of-thought reasoning and query routing for complex comparisons, aiming for accurate and contextually rich answers.

Quick Start & Requirements

Install: git clone https://github.com/IlyaRice/RAG-Challenge-2.git, cd RAG-Challenge-2, python -m venv venv, venv\Scripts\Activate.ps1 (Windows), pip install -e . -r requirements.txt.
Prerequisites: OpenAI/Gemini API keys. A GPU is highly recommended for PDF parsing (e.g., RTX 4090).
Setup: Rename env to .env and add API keys.
Datasets: Includes a small test set (data/test_set/) and the full competition dataset (data/erc2_set/). Refer to dataset-specific READMEs for details.
Usage: Run pipeline stages by uncommenting in src/pipeline.py or via main.py CLI commands (e.g., python main.py parse-pdfs).
Docs: Dataset READMEs: data/test_set/README.md, data/erc2_set/README.md.

Highlighted Details

Won all categories in the Enterprise RAG Challenge 2.
Features custom PDF parsing with Docling.
Implements vector search with parent document retrieval.
Utilizes LLM reranking for improved context relevance.
Employs structured output prompting with chain-of-thought reasoning.

Maintenance & Community

The project is presented as competition code with "rough edges and weird workarounds." It lacks tests and has minimal error handling, indicating it is not production-ready. No specific community channels or roadmap are mentioned.

Licensing & Compatibility

License: MIT.
Compatibility: Suitable for commercial use and closed-source linking due to the permissive MIT license.

Limitations & Caveats

This code is described as "scrappy" and not production-ready, featuring rough edges, workarounds, no tests, and minimal error handling. IBM Watson integration is non-functional as it was competition-specific. Users must provide their own API keys.

RAG-Challenge-2 by IlyaRice

Explore Similar Projects

talk2arxiv by evanhu1

rag-demystified by pchunduri6

rag-fusion by Raudaschl

ollama_pdf_rag by tonykipkemboi

cdQA by cdqa-suite

raptor by parthsarthi03

ask-my-pdf by mobarski

Chinese-LangChain by yanqiangmiffy

PageIndex by VectifyAI

pdfGPT by bhaskatripathi

ask-multiple-pdfs by alejandro-ao

WeKnora by Tencent