openevals  by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 7 months ago
726 stars

Top 47.5% on SourcePulse

GitHubView on GitHub
Project Summary

OpenEvals provides a comprehensive suite of tools for evaluating Large Language Model (LLM) applications, aiming to bring the rigor of traditional software testing to LLM development. It offers a variety of pre-built evaluators for common tasks like correctness, conciseness, hallucination detection, and retrieval relevance, alongside capabilities for code evaluation and multi-turn conversation simulation. The library is designed for developers and researchers building and deploying LLM-powered applications, enabling them to systematically assess and improve their models' performance.

How It Works

The core of OpenEvals revolves around its create_llm_as_judge function, which leverages another LLM to act as a judge for evaluating outputs. This approach allows for flexible and customizable evaluation criteria by defining prompts that guide the judge LLM. Beyond LLM-as-judge, OpenEvals includes string-based evaluators (e.g., Levenshtein distance, embedding similarity), exact match evaluators for structured data, and specialized evaluators for code quality (type-checking, execution) often within sandboxed environments for safety.

Quick Start & Requirements

  • Installation: pip install openevals (Python) or npm install openevals @langchain/core (TypeScript).
  • Prerequisites: For LLM-as-judge evaluators, an OpenAI API key is required and should be set as an environment variable (OPENAI_API_KEY). Additional dependencies like langchain-openai or openai may be needed for specific model integrations. Sandboxed evaluators require e2b-code-interpreter (pip install openevals["e2b-code-interpreter"]) and an E2B API key.
  • Usage: Import evaluators and run them with inputs, outputs, and optionally reference_outputs.
  • Documentation: https://github.com/langchain-ai/openevals

Highlighted Details

  • LLM-as-Judge: Highly customizable evaluation using LLMs as judges with configurable prompts, models, and output schemas.
  • Code Evaluation: Includes type-checking (Pyright, Mypy, TypeScript) and execution evaluators, with sandboxed options for secure dependency management and execution.
  • Multi-turn Simulation: Tools to simulate conversations between an application and a user, evaluating the entire interaction trajectory.
  • LangSmith Integration: Seamless logging of evaluation results to LangSmith for experiment tracking and analysis.

Maintenance & Community

The project is actively maintained by the LangChain AI team. Community engagement is encouraged through GitHub issues and X (@LangChainAI).

Licensing & Compatibility

The project appears to be under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

While comprehensive, the effectiveness of LLM-as-judge evaluators is dependent on the quality of the prompts and the capabilities of the judge LLM. Sandboxed execution requires E2B setup and API keys. Some code evaluators may ignore specific error types (e.g., reportMissingImports).

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
6
Issues (30d)
1
Star History
44 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Feedback? Help us improve.