Evaluation tool for RAG chat applications
Top 90.7% on sourcepulse
This repository provides tools for evaluating Retrieval-Augmented Generation (RAG) chat applications, specifically targeting developers and researchers aiming to improve response quality. It offers a framework for running systematic evaluations, measuring key performance indicators, and comparing different configurations of RAG systems.
How It Works
The core of the project is the evaltools
Python package, which orchestrates evaluations by interacting with a target chat application. It leverages Azure AI Evaluate SDK and OpenAI models for generating metrics like groundedness, relevance, and coherence. Evaluations are configured via JSON files, specifying test data, target application endpoints, and desired metrics. The system supports both built-in GPT-based metrics and custom metrics, including code-based ones like latency and answer length.
Quick Start & Requirements
python -m pip install -e .
within a Python 3.10+ virtual environment..env
file) with Azure/OpenAI credentials and deployment details.Highlighted Details
Maintenance & Community
The project is part of Azure Samples, indicating official Microsoft backing. Specific contributor details or community channels (like Discord/Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
The repository's license is not specified in the provided README. Compatibility for commercial use or closed-source linking would depend on the actual license.
Limitations & Caveats
Built-in GPT metrics are primarily intended for English language answers. Evaluating non-English responses requires using custom prompt metrics. The cost of running evaluations can be significant due to token usage by GPT models.
3 weeks ago
Inactive