tonic_validate by TonicAI

LLM/RAG evaluation framework

Created 2 years ago

320 stars

Top 84.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

Tonic Validate is an open-source framework designed to evaluate the quality of responses from Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) applications. It offers a suite of metrics to assess aspects like answer correctness, retrieval relevance, and hallucination, with an optional UI for visualizing results, targeting developers and researchers building LLM-powered systems.

How It Works

The framework operates by accepting benchmark data (questions, reference answers) and LLM outputs (generated answers, retrieved contexts). It then applies various metrics, many of which leverage an LLM (defaulting to GPT-4 Turbo, but configurable to OpenAI, Azure OpenAI, Gemini, Claude, Mistral, Cohere, Together AI, and AWS Bedrock) to score the quality of the LLM's response against the provided inputs. Users can either provide a callback function to capture LLM responses or manually log them for scoring.

Quick Start & Requirements

Install: pip install tonic-validate
Prerequisites: OpenAI API key (or other supported LLM provider keys) set as environment variables. Python 3.9+ required for Gemini.
Documentation: Explore the docs »

Highlighted Details

Supports a wide range of LLM providers for metric evaluation.
Offers metrics like Answer Similarity, Retrieval Precision, Augmentation Precision, Augmentation Accuracy, Answer Consistency, Latency, and Contains Text.
Includes an optional, free UI for visualizing evaluation results and tracking performance over time.
Provides a GitHub Action for integrating evaluations into CI/CD pipelines.

Maintenance & Community

Active development with contributions encouraged via pull requests and issues.
Community support via GitHub issues.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

Metric scoring relies on external LLM APIs, incurring potential costs and latency.
Telemetry is collected by default but can be opted out by setting the TONIC_VALIDATE_DO_NOT_TRACK environment variable.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days