continuous-eval  by relari-ai

Open-source package for data-driven LLM application evaluation

Created 1 year ago
505 stars

Top 61.7% on SourcePulse

GitHubView on GitHub
Project Summary

This package provides a data-driven evaluation framework for LLM-powered applications, enabling granular assessment of individual pipeline modules and end-to-end performance. It is designed for developers and researchers building and optimizing LLM applications, offering a comprehensive library of deterministic, semantic, and LLM-based metrics for various use cases like RAG, code generation, and agent tool use.

How It Works

The framework supports both single-module and multi-module pipeline evaluations. Users define pipelines as a sequence of modules, each with specific inputs, outputs, and associated metrics. The EvaluationRunner orchestrates the execution of these metrics against provided datasets, allowing for modular assessment and the identification of performance bottlenecks. It supports custom metrics, including LLM-as-a-Judge implementations, for tailored evaluation needs.

Quick Start & Requirements

  • Install via pip: python3 -m pip install continuous-eval
  • For LLM-based metrics, an LLM API key (e.g., OpenAI) is required in a .env file.
  • See Docs and Examples Repo.

Highlighted Details

  • Supports modular evaluation of complex, multi-stage LLM pipelines.
  • Offers a diverse metric library covering RAG, code generation, agent tool use, and more.
  • Enables custom metric creation, including LLM-as-a-Judge patterns.
  • Includes probabilistic evaluation capabilities.

Maintenance & Community

  • Active development with a community Discord server available for support and discussion.
  • Blog posts offer practical guides on RAG evaluation and GenAI app assessment.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

The README mentions that running the script in a new process is important to avoid multiprocessing issues, suggesting potential complexities in parallel execution setup.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

opik by comet-ml

1.7%
14k
Open-source LLM evaluation framework for RAG, agents, and more
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.