openevals by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 11 months ago

870 stars

Top 41.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Marc Klingen

Cofounder of Langfuse

Project Summary

OpenEvals provides a comprehensive suite of tools for evaluating Large Language Model (LLM) applications, aiming to bring the rigor of traditional software testing to LLM development. It offers a variety of pre-built evaluators for common tasks like correctness, conciseness, hallucination detection, and retrieval relevance, alongside capabilities for code evaluation and multi-turn conversation simulation. The library is designed for developers and researchers building and deploying LLM-powered applications, enabling them to systematically assess and improve their models' performance.

How It Works

The core of OpenEvals revolves around its create_llm_as_judge function, which leverages another LLM to act as a judge for evaluating outputs. This approach allows for flexible and customizable evaluation criteria by defining prompts that guide the judge LLM. Beyond LLM-as-judge, OpenEvals includes string-based evaluators (e.g., Levenshtein distance, embedding similarity), exact match evaluators for structured data, and specialized evaluators for code quality (type-checking, execution) often within sandboxed environments for safety.

Quick Start & Requirements

Installation: pip install openevals (Python) or npm install openevals @langchain/core (TypeScript).
Prerequisites: For LLM-as-judge evaluators, an OpenAI API key is required and should be set as an environment variable (OPENAI_API_KEY). Additional dependencies like langchain-openai or openai may be needed for specific model integrations. Sandboxed evaluators require e2b-code-interpreter (pip install openevals["e2b-code-interpreter"]) and an E2B API key.
Usage: Import evaluators and run them with inputs, outputs, and optionally reference_outputs.
Documentation: https://github.com/langchain-ai/openevals

Highlighted Details

LLM-as-Judge: Highly customizable evaluation using LLMs as judges with configurable prompts, models, and output schemas.
Code Evaluation: Includes type-checking (Pyright, Mypy, TypeScript) and execution evaluators, with sandboxed options for secure dependency management and execution.
Multi-turn Simulation: Tools to simulate conversations between an application and a user, evaluating the entire interaction trajectory.
LangSmith Integration: Seamless logging of evaluation results to LangSmith for experiment tracking and analysis.

Maintenance & Community

The project is actively maintained by the LangChain AI team. Community engagement is encouraged through GitHub issues and X (@LangChainAI).

Licensing & Compatibility

The project appears to be under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

While comprehensive, the effectiveness of LLM-as-judge evaluators is dependent on the quality of the prompts and the capabilities of the judge LLM. Sandboxed execution requires E2B setup and API keys. Some code evaluators may ignore specific error types (e.g., reportMissingImports).

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days