Open-source package for data-driven LLM application evaluation
Top 62.9% on sourcepulse
This package provides a data-driven evaluation framework for LLM-powered applications, enabling granular assessment of individual pipeline modules and end-to-end performance. It is designed for developers and researchers building and optimizing LLM applications, offering a comprehensive library of deterministic, semantic, and LLM-based metrics for various use cases like RAG, code generation, and agent tool use.
How It Works
The framework supports both single-module and multi-module pipeline evaluations. Users define pipelines as a sequence of modules, each with specific inputs, outputs, and associated metrics. The EvaluationRunner
orchestrates the execution of these metrics against provided datasets, allowing for modular assessment and the identification of performance bottlenecks. It supports custom metrics, including LLM-as-a-Judge implementations, for tailored evaluation needs.
Quick Start & Requirements
python3 -m pip install continuous-eval
.env
file.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions that running the script in a new process is important to avoid multiprocessing issues, suggesting potential complexities in parallel execution setup.
6 months ago
1 day