RAGChecker  by amazon-science

RAG evaluation framework for diagnosing RAG systems

created 1 year ago
946 stars

Top 39.6% on sourcepulse

GitHubView on GitHub
Project Summary

RAGChecker is an open-source framework for the fine-grained evaluation and diagnosis of Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of metrics for both overall pipeline assessment and detailed analysis of retriever and generator components, empowering developers and researchers to pinpoint and address performance bottlenecks.

How It Works

RAGChecker employs claim-level entailment operations for granular evaluation, breaking down RAG performance into specific aspects like faithfulness, context utilization, and hallucination. It leverages large language models (LLMs) as "extractors" and "checkers" to analyze query-response pairs and retrieved contexts, providing diagnostic metrics that offer deeper insights than traditional end-to-end evaluations.

Quick Start & Requirements

  • Install via pip: pip install ragchecker
  • Requires spaCy model: python -m spacy download en_core_web_sm
  • CLI usage: ragchecker-cli --input_path=<your_data.json> --output_path=<output.json> --extractor_name=<extractor_model> --checker_name=<checker_model> --metrics all_metrics
  • Python API available for programmatic use.
  • Example input format and output metrics are provided.
  • Integration with LlamaIndex is available.
  • Official paper and tutorial links are provided.

Highlighted Details

  • Holistic and diagnostic metrics for RAG pipeline analysis.
  • Fine-grained evaluation using claim-level entailment.
  • Includes a benchmark dataset (4k questions, 10 domains) and a meta-evaluation dataset.
  • Presented at NeurIPS Dataset and Benchmark Track.

Maintenance & Community

  • Project is actively developed by Amazon Science.
  • Paper published on arXiv (2408.08067).
  • Contribution guidelines are available.

Licensing & Compatibility

  • Licensed under the Apache-2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source systems.

Limitations & Caveats

The framework's performance is dependent on the chosen LLMs for extraction and checking. Specific LLM configurations (e.g., AWS Bedrock Llama3 70B) are mentioned for the quick start, implying potential dependencies on specific model providers or versions.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
83 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jerry Liu Jerry Liu(Cofounder of LlamaIndex).

deepeval by confident-ai

2.0%
10k
LLM evaluation framework for unit testing LLM outputs
created 2 years ago
updated 11 hours ago
Feedback? Help us improve.