ai-rag-chat-evaluator by Azure-Samples

Evaluation tool for RAG chat applications

Created 2 years ago

321 stars

Top 84.7% on SourcePulse

Project Summary

This repository provides tools for evaluating Retrieval-Augmented Generation (RAG) chat applications, specifically targeting developers and researchers aiming to improve response quality. It offers a framework for running systematic evaluations, measuring key performance indicators, and comparing different configurations of RAG systems.

How It Works

The core of the project is the evaltools Python package, which orchestrates evaluations by interacting with a target chat application. It leverages Azure AI Evaluate SDK and OpenAI models for generating metrics like groundedness, relevance, and coherence. Evaluations are configured via JSON files, specifying test data, target application endpoints, and desired metrics. The system supports both built-in GPT-based metrics and custom metrics, including code-based ones like latency and answer length.

Quick Start & Requirements

Install via python -m pip install -e . within a Python 3.10+ virtual environment.
Requires an Azure OpenAI or OpenAI.com instance with a deployed GPT-4 model for evaluation metrics.
Setup involves configuring environment variables (.env file) with Azure/OpenAI credentials and deployment details.
Official documentation and examples are available within the repository.

Highlighted Details

Supports evaluation of RAG chat apps against custom datasets.
Includes metrics for measuring the model's ability to correctly respond with "I don't know" when information is not present.
Offers tools for reviewing and comparing evaluation results across different runs.
Allows customization of target application parameters and response parsing via JMESPath.

Maintenance & Community

The project is part of Azure Samples, indicating official Microsoft backing. Specific contributor details or community channels (like Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not specified in the provided README. Compatibility for commercial use or closed-source linking would depend on the actual license.

Limitations & Caveats

Built-in GPT metrics are primarily intended for English language answers. Evaluating non-English responses requires using custom prompt metrics. The cost of running evaluations can be significant due to token usage by GPT models.

Health Check

Last Commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days