continuous-eval by relari-ai

Open-source package for data-driven LLM application evaluation

Created 2 years ago

515 stars

Top 60.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

This package provides a data-driven evaluation framework for LLM-powered applications, enabling granular assessment of individual pipeline modules and end-to-end performance. It is designed for developers and researchers building and optimizing LLM applications, offering a comprehensive library of deterministic, semantic, and LLM-based metrics for various use cases like RAG, code generation, and agent tool use.

How It Works

The framework supports both single-module and multi-module pipeline evaluations. Users define pipelines as a sequence of modules, each with specific inputs, outputs, and associated metrics. The EvaluationRunner orchestrates the execution of these metrics against provided datasets, allowing for modular assessment and the identification of performance bottlenecks. It supports custom metrics, including LLM-as-a-Judge implementations, for tailored evaluation needs.

Quick Start & Requirements

Install via pip: python3 -m pip install continuous-eval
For LLM-based metrics, an LLM API key (e.g., OpenAI) is required in a .env file.
See Docs and Examples Repo.

Highlighted Details

Supports modular evaluation of complex, multi-stage LLM pipelines.
Offers a diverse metric library covering RAG, code generation, agent tool use, and more.
Enables custom metric creation, including LLM-as-a-Judge patterns.
Includes probabilistic evaluation capabilities.

Maintenance & Community

Active development with a community Discord server available for support and discussion.
Blog posts offer practical guides on RAG evaluation and GenAI app assessment.

Licensing & Compatibility

Licensed under Apache 2.0.
Compatible with commercial and closed-source applications.

Limitations & Caveats

The README mentions that running the script in a new process is important to avoid multiprocessing issues, suggesting potential complexities in parallel execution setup.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days