ARES  by stanford-futuredata

RAG evaluation framework

created 1 year ago
634 stars

Top 53.3% on sourcepulse

GitHubView on GitHub
Project Summary

ARES is an automated framework for evaluating Retrieval-Augmented Generation (RAG) systems, designed for researchers and developers. It automates the assessment of context relevance, answer faithfulness, and answer relevance by combining synthetic data generation with fine-tuned classifiers, significantly reducing the need for manual annotation.

How It Works

ARES employs Prediction-Powered Inference (PPI) and synthetic data generation. It uses fine-tuned classifiers trained on synthetically generated queries and answers, alongside human-annotated data, to evaluate RAG outputs. This approach allows for accurate assessments with statistical confidence, even when dealing with model response variability. The framework is model-agnostic, enabling evaluation of custom RAG pipelines.

Quick Start & Requirements

  • Installation: pip install ares-ai
  • API Keys: Set OPENAI_API_KEY or TOGETHER_API_KEY environment variables.
  • Requirements: Requires a human preference validation set (50-hundreds of examples), few-shot examples for scoring, and a larger set of unlabeled query-document-answer triples from the RAG system.
  • Datasets: Example datasets can be downloaded using wget commands provided in the README. The full NQ dataset (37.3 GB) can be fetched via ares.KILT_dataset("nq").
  • Documentation: https://github.com/stanford-futuredata/ARES#documentation

Highlighted Details

  • Supports local model execution via vLLM for enhanced privacy and offline capabilities.
  • Provides tools for synthetic query generation and classifier training.
  • Offers direct comparison of RAG configurations and evaluation against ground truth.
  • Includes example configurations for UES/IDP scoring, PPI evaluation, and classifier training.

Maintenance & Community

Licensing & Compatibility

  • The README does not explicitly state a license. The project is hosted by Stanford University, implying a research-oriented license, but specific terms are not detailed.

Limitations & Caveats

The framework requires significant computational resources, including over 100 GB of disk space and powerful GPUs (A100 recommended). Smaller GPUs may encounter CUDA out-of-memory errors. Setup on cloud VMs requires manual installation of Conda, GCC, and NVIDIA drivers.

Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
53 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.