Discover and explore top open-source AI tools and projects—updated daily.
shihongDevGenAI application evaluation framework
Top 99.6% on SourcePulse
GenAI applications require robust evaluation for continuous improvement. Evalyn addresses this by providing a lightweight, local-first evaluation framework designed for both developers and non-technical users. It simplifies the process of tracing LLM calls, annotating human feedback, suggesting relevant metrics, and calibrating evaluation models, ensuring that GenAI app behavior can be understood and enhanced effectively, all while keeping data entirely on the user's machine.
How It Works
Evalyn employs a four-stage pipeline: Collect, Evaluate, Calibrate, and Expand. In the Collect phase, an @eval decorator automatically captures LLM interactions, logging traces to SQLite and generating datasets in JSONL format. The Evaluate stage involves suggesting relevant objective and LLM-based metrics from a large bank, then running the evaluation to produce reports. The Calibrate phase is crucial for aligning LLM judges with human feedback through automated prompt optimization techniques like GEPA, and clustering failures for deeper insights. Finally, the Expand stage uses simulation to generate synthetic queries, feeding back into the evaluation loop for continuous improvement. This iterative, local-first approach makes GenAI evaluation practical and accessible.
Quick Start & Requirements
uv (e.g., curl -LsSf https://astral.sh/uv/install.sh | sh), create a Python 3.10+ virtual environment (uv venv --python 3.10), activate it, and then install the SDK (uv pip install -e "./sdk[llm]").uv. API keys (e.g., GEMINI_API_KEY, OPENAI_API_KEY) are required for LLM judges.evalyn one-click command for a streamlined workflow.example_agents/ directory.Highlighted Details
evalyn one-click command automates the entire data collection, metric suggestion, evaluation, and reporting process.Maintenance & Community
Community contributions for new metrics are welcomed, with a detailed guide provided for implementing both objective and subjective metrics. Issues can be submitted via GitHub, and direct contact is available via email at lsh98dev@gmail.com. Example integrations for popular frameworks like LangChain are included.
Licensing & Compatibility
The project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.
Limitations & Caveats
The reliance on LLM judges for subjective metrics introduces potential costs associated with API calls and requires careful calibration to mitigate bias. While the framework aims for ease of use, advanced features like calibration and simulation may demand a deeper understanding of evaluation principles and potentially significant human annotation effort. The project specifies Python 3.10+ as a requirement.
3 days ago
Inactive
braintrustdata
groq
YiVal
huggingface
comet-ml