evalyn by shihongDev

GenAI application evaluation framework

Created 2 months ago

252 stars

Top 99.6% on SourcePulse

Project Summary

GenAI applications require robust evaluation for continuous improvement. Evalyn addresses this by providing a lightweight, local-first evaluation framework designed for both developers and non-technical users. It simplifies the process of tracing LLM calls, annotating human feedback, suggesting relevant metrics, and calibrating evaluation models, ensuring that GenAI app behavior can be understood and enhanced effectively, all while keeping data entirely on the user's machine.

How It Works

Evalyn employs a four-stage pipeline: Collect, Evaluate, Calibrate, and Expand. In the Collect phase, an @eval decorator automatically captures LLM interactions, logging traces to SQLite and generating datasets in JSONL format. The Evaluate stage involves suggesting relevant objective and LLM-based metrics from a large bank, then running the evaluation to produce reports. The Calibrate phase is crucial for aligning LLM judges with human feedback through automated prompt optimization techniques like GEPA, and clustering failures for deeper insights. Finally, the Expand stage uses simulation to generate synthetic queries, feeding back into the evaluation loop for continuous improvement. This iterative, local-first approach makes GenAI evaluation practical and accessible.

Quick Start & Requirements

Installation: Install uv (e.g., curl -LsSf https://astral.sh/uv/install.sh | sh), create a Python 3.10+ virtual environment (uv venv --python 3.10), activate it, and then install the SDK (uv pip install -e "./sdk[llm]").
Prerequisites: Python 3.10+, uv. API keys (e.g., GEMINI_API_KEY, OPENAI_API_KEY) are required for LLM judges.
Setup: The README provides example agents and a evalyn one-click command for a streamlined workflow.
Links: Example agents are located in the example_agents/ directory.

Highlighted Details

Fully Local: All data, including traces and datasets, is stored locally using SQLite, eliminating cloud dependencies and ensuring data privacy.
Extensive Metric Bank: Features over 130 built-in metrics (73 objective, 60 LLM judges), with community contributions actively encouraged.
Automated Calibration: Includes automatic prompt optimization (e.g., GEPA) to align LLM judges with human feedback, enhancing evaluation consistency and accuracy.
One-Click Pipeline: A single evalyn one-click command automates the entire data collection, metric suggestion, evaluation, and reporting process.

Maintenance & Community

Community contributions for new metrics are welcomed, with a detailed guide provided for implementing both objective and subjective metrics. Issues can be submitted via GitHub, and direct contact is available via email at lsh98dev@gmail.com. Example integrations for popular frameworks like LangChain are included.

Licensing & Compatibility

The project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The reliance on LLM judges for subjective metrics introduces potential costs associated with API calls and requires careful calibration to mitigate bias. While the framework aims for ease of use, advanced features like calibration and simulation may demand a deeper understanding of evaluation principles and potentially significant human annotation effort. The project specifies Python 3.10+ as a requirement.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

112 stars in the last 30 days