RAG system for enterprise document question answering
Top 25.1% on sourcepulse
This repository offers the winning solution for the Enterprise RAG Challenge 2, designed for researchers and practitioners exploring advanced RAG techniques. It provides a robust system for question answering on company annual reports, achieving state-of-the-art results through a combination of custom PDF parsing, vector search with parent document retrieval, LLM reranking, and structured output prompting.
How It Works
The system employs a multi-stage RAG pipeline. It begins with custom PDF parsing using Docling, followed by vector search enhanced with parent document retrieval to improve context relevance. A crucial step involves LLM reranking to further refine the retrieved context. Finally, it utilizes structured output prompting with chain-of-thought reasoning and query routing for complex comparisons, aiming for accurate and contextually rich answers.
Quick Start & Requirements
git clone https://github.com/IlyaRice/RAG-Challenge-2.git
, cd RAG-Challenge-2
, python -m venv venv
, venv\Scripts\Activate.ps1
(Windows), pip install -e . -r requirements.txt
.env
to .env
and add API keys.data/test_set/
) and the full competition dataset (data/erc2_set/
). Refer to dataset-specific READMEs for details.src/pipeline.py
or via main.py
CLI commands (e.g., python main.py parse-pdfs
).Highlighted Details
Maintenance & Community
The project is presented as competition code with "rough edges and weird workarounds." It lacks tests and has minimal error handling, indicating it is not production-ready. No specific community channels or roadmap are mentioned.
Licensing & Compatibility
Limitations & Caveats
This code is described as "scrappy" and not production-ready, featuring rough edges, workarounds, no tests, and minimal error handling. IBM Watson integration is non-functional as it was competition-specific. Users must provide their own API keys.
2 months ago
Inactive