RAGChecker  by amazon-science

RAG evaluation framework for diagnosing RAG systems

Created 1 year ago
988 stars

Top 37.6% on SourcePulse

GitHubView on GitHub
Project Summary

RAGChecker is an open-source framework for the fine-grained evaluation and diagnosis of Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of metrics for both overall pipeline assessment and detailed analysis of retriever and generator components, empowering developers and researchers to pinpoint and address performance bottlenecks.

How It Works

RAGChecker employs claim-level entailment operations for granular evaluation, breaking down RAG performance into specific aspects like faithfulness, context utilization, and hallucination. It leverages large language models (LLMs) as "extractors" and "checkers" to analyze query-response pairs and retrieved contexts, providing diagnostic metrics that offer deeper insights than traditional end-to-end evaluations.

Quick Start & Requirements

  • Install via pip: pip install ragchecker
  • Requires spaCy model: python -m spacy download en_core_web_sm
  • CLI usage: ragchecker-cli --input_path=<your_data.json> --output_path=<output.json> --extractor_name=<extractor_model> --checker_name=<checker_model> --metrics all_metrics
  • Python API available for programmatic use.
  • Example input format and output metrics are provided.
  • Integration with LlamaIndex is available.
  • Official paper and tutorial links are provided.

Highlighted Details

  • Holistic and diagnostic metrics for RAG pipeline analysis.
  • Fine-grained evaluation using claim-level entailment.
  • Includes a benchmark dataset (4k questions, 10 domains) and a meta-evaluation dataset.
  • Presented at NeurIPS Dataset and Benchmark Track.

Maintenance & Community

  • Project is actively developed by Amazon Science.
  • Paper published on arXiv (2408.08067).
  • Contribution guidelines are available.

Licensing & Compatibility

  • Licensed under the Apache-2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source systems.

Limitations & Caveats

The framework's performance is dependent on the chosen LLMs for extraction and checking. Specific LLM configurations (e.g., AWS Bedrock Llama3 70B) are mentioned for the quick start, implying potential dependencies on specific model providers or versions.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
1 more.

AutoRAG by Marker-Inc-Korea

0.3%
4k
RAG AutoML tool for optimizing RAG pipelines
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.