enterprise-rag-challenge  by trustbit

Evaluating RAG system accuracy on enterprise data

Created 1 year ago
259 stars

Top 97.9% on SourcePulse

GitHubView on GitHub
Project Summary

Enterprise RAG Challenge

This repository hosts a challenge designed to rigorously test the accuracy and hallucination rates of Retrieval Augmented Generation (RAG) systems when processing enterprise data, specifically public annual reports. It targets developers, researchers, and power users building or evaluating LLM-driven assistants, offering a quantifiable benchmark for system performance in real-world information retrieval tasks. The primary benefit is objective comparison of RAG approaches against a standardized, verifiable dataset and question set.

How It Works

The challenge centers on participants deploying RAG systems capable of ingesting a provided corpus of public annual reports and answering a predefined set of questions. Accuracy is paramount, with a focus on detecting factual correctness and minimizing hallucinations. The system employs a unique, verifiable approach to ensure fairness: random seeds are generated using public blockchain APIs, and a deterministic question generator uses these seeds to create identical question sets for all participants, guaranteeing a level playing field.

Quick Start & Requirements

  • Repository Size: ~1.4GB (due to sample PDFs). The full dataset comprises ~46GB of annual reports.
  • Core Functionality: Requires a system capable of ingesting PDF documents and responding to questions via an API or UI.
  • Dependencies: Python 3 is implied for running provided scripts (gen_seed.py, main.py). Specific hardware or software dependencies beyond standard Python environments are not detailed but are expected for RAG system implementation (e.g., vector databases, LLM APIs).
  • Dataset: Participants can upload a test set of public annual reports. A comprehensive list of 7496 files (~46GB) is described, though only subsets are provided initially.
  • Verification: Scripts are provided for generating reproducible seeds and sampling datasets/questions.

Highlighted Details

  • Dataset Scale: Utilizes a large corpus of 7496 public annual reports (~46GB).
  • Verifiable Randomness: Employs blockchain APIs for unpredictable, yet reproducible, seed generation.
  • Deterministic Question Generation: Ensures all participants face the exact same questions for a given seed, enabling direct comparison.
  • Hallucination Focus: Explicitly penalizes systems that generate answers when information is not present in the source documents, requiring "N/A" responses.
  • Submission: Submissions are handled via a provided API and user interface, incorporating verifiable TSP signatures.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmaps are present in the provided README.

Licensing & Compatibility

The repository is released under the Apache 2.0 license. This permissive license allows for unrestricted use, modification, and distribution, including for commercial purposes and integration into closed-source projects, provided attribution and license terms are followed.

Limitations & Caveats

The full 46GB dataset is not provided upfront, requiring participants to potentially gather or manage large data volumes. The challenge strictly measures accuracy and hallucination rates, excluding other critical RAG performance metrics like latency, cost, or throughput. Some questions may be intentionally nonsensical or refer to companies not present in the provided documents to specifically test hallucination handling.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.