RAGChecker by amazon-science

RAG evaluation framework for diagnosing RAG systems

Created 1 year ago

1,041 stars

Top 36.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

RAGChecker is an open-source framework for the fine-grained evaluation and diagnosis of Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of metrics for both overall pipeline assessment and detailed analysis of retriever and generator components, empowering developers and researchers to pinpoint and address performance bottlenecks.

How It Works

RAGChecker employs claim-level entailment operations for granular evaluation, breaking down RAG performance into specific aspects like faithfulness, context utilization, and hallucination. It leverages large language models (LLMs) as "extractors" and "checkers" to analyze query-response pairs and retrieved contexts, providing diagnostic metrics that offer deeper insights than traditional end-to-end evaluations.

Quick Start & Requirements

Install via pip: pip install ragchecker
Requires spaCy model: python -m spacy download en_core_web_sm
CLI usage: ragchecker-cli --input_path=<your_data.json> --output_path=<output.json> --extractor_name=<extractor_model> --checker_name=<checker_model> --metrics all_metrics
Python API available for programmatic use.
Example input format and output metrics are provided.
Integration with LlamaIndex is available.
Official paper and tutorial links are provided.

Highlighted Details

Holistic and diagnostic metrics for RAG pipeline analysis.
Fine-grained evaluation using claim-level entailment.
Includes a benchmark dataset (4k questions, 10 domains) and a meta-evaluation dataset.
Presented at NeurIPS Dataset and Benchmark Track.

Maintenance & Community

Project is actively developed by Amazon Science.
Paper published on arXiv (2408.08067).
Contribution guidelines are available.

Licensing & Compatibility

Licensed under the Apache-2.0 License.
Permissive license suitable for commercial use and integration with closed-source systems.

Limitations & Caveats

The framework's performance is dependent on the chosen LLMs for extraction and checking. Specific LLM configurations (e.g., AWS Bedrock Llama3 70B) are mentioned for the quick start, implying potential dependencies on specific model providers or versions.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days