continuous-eval  by relari-ai

Open-source package for data-driven LLM application evaluation

created 1 year ago
501 stars

Top 62.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This package provides a data-driven evaluation framework for LLM-powered applications, enabling granular assessment of individual pipeline modules and end-to-end performance. It is designed for developers and researchers building and optimizing LLM applications, offering a comprehensive library of deterministic, semantic, and LLM-based metrics for various use cases like RAG, code generation, and agent tool use.

How It Works

The framework supports both single-module and multi-module pipeline evaluations. Users define pipelines as a sequence of modules, each with specific inputs, outputs, and associated metrics. The EvaluationRunner orchestrates the execution of these metrics against provided datasets, allowing for modular assessment and the identification of performance bottlenecks. It supports custom metrics, including LLM-as-a-Judge implementations, for tailored evaluation needs.

Quick Start & Requirements

  • Install via pip: python3 -m pip install continuous-eval
  • For LLM-based metrics, an LLM API key (e.g., OpenAI) is required in a .env file.
  • See Docs and Examples Repo.

Highlighted Details

  • Supports modular evaluation of complex, multi-stage LLM pipelines.
  • Offers a diverse metric library covering RAG, code generation, agent tool use, and more.
  • Enables custom metric creation, including LLM-as-a-Judge patterns.
  • Includes probabilistic evaluation capabilities.

Maintenance & Community

  • Active development with a community Discord server available for support and discussion.
  • Blog posts offer practical guides on RAG evaluation and GenAI app assessment.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

The README mentions that running the script in a new process is important to avoid multiprocessing issues, suggesting potential complexities in parallel execution setup.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jerry Liu Jerry Liu(Cofounder of LlamaIndex).

deepeval by confident-ai

2.0%
10k
LLM evaluation framework for unit testing LLM outputs
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.