ai-rag-chat-evaluator  by Azure-Samples

Evaluation tool for RAG chat applications

created 1 year ago
295 stars

Top 90.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools for evaluating Retrieval-Augmented Generation (RAG) chat applications, specifically targeting developers and researchers aiming to improve response quality. It offers a framework for running systematic evaluations, measuring key performance indicators, and comparing different configurations of RAG systems.

How It Works

The core of the project is the evaltools Python package, which orchestrates evaluations by interacting with a target chat application. It leverages Azure AI Evaluate SDK and OpenAI models for generating metrics like groundedness, relevance, and coherence. Evaluations are configured via JSON files, specifying test data, target application endpoints, and desired metrics. The system supports both built-in GPT-based metrics and custom metrics, including code-based ones like latency and answer length.

Quick Start & Requirements

  • Install via python -m pip install -e . within a Python 3.10+ virtual environment.
  • Requires an Azure OpenAI or OpenAI.com instance with a deployed GPT-4 model for evaluation metrics.
  • Setup involves configuring environment variables (.env file) with Azure/OpenAI credentials and deployment details.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Supports evaluation of RAG chat apps against custom datasets.
  • Includes metrics for measuring the model's ability to correctly respond with "I don't know" when information is not present.
  • Offers tools for reviewing and comparing evaluation results across different runs.
  • Allows customization of target application parameters and response parsing via JMESPath.

Maintenance & Community

The project is part of Azure Samples, indicating official Microsoft backing. Specific contributor details or community channels (like Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not specified in the provided README. Compatibility for commercial use or closed-source linking would depend on the actual license.

Limitations & Caveats

Built-in GPT metrics are primarily intended for English language answers. Evaluating non-English responses requires using custom prompt metrics. The cost of running evaluations can be significant due to token usage by GPT models.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.