openevals  by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

created 5 months ago
651 stars

Top 52.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OpenEvals provides a comprehensive suite of tools for evaluating Large Language Model (LLM) applications, aiming to bring the rigor of traditional software testing to LLM development. It offers a variety of pre-built evaluators for common tasks like correctness, conciseness, hallucination detection, and retrieval relevance, alongside capabilities for code evaluation and multi-turn conversation simulation. The library is designed for developers and researchers building and deploying LLM-powered applications, enabling them to systematically assess and improve their models' performance.

How It Works

The core of OpenEvals revolves around its create_llm_as_judge function, which leverages another LLM to act as a judge for evaluating outputs. This approach allows for flexible and customizable evaluation criteria by defining prompts that guide the judge LLM. Beyond LLM-as-judge, OpenEvals includes string-based evaluators (e.g., Levenshtein distance, embedding similarity), exact match evaluators for structured data, and specialized evaluators for code quality (type-checking, execution) often within sandboxed environments for safety.

Quick Start & Requirements

  • Installation: pip install openevals (Python) or npm install openevals @langchain/core (TypeScript).
  • Prerequisites: For LLM-as-judge evaluators, an OpenAI API key is required and should be set as an environment variable (OPENAI_API_KEY). Additional dependencies like langchain-openai or openai may be needed for specific model integrations. Sandboxed evaluators require e2b-code-interpreter (pip install openevals["e2b-code-interpreter"]) and an E2B API key.
  • Usage: Import evaluators and run them with inputs, outputs, and optionally reference_outputs.
  • Documentation: https://github.com/langchain-ai/openevals

Highlighted Details

  • LLM-as-Judge: Highly customizable evaluation using LLMs as judges with configurable prompts, models, and output schemas.
  • Code Evaluation: Includes type-checking (Pyright, Mypy, TypeScript) and execution evaluators, with sandboxed options for secure dependency management and execution.
  • Multi-turn Simulation: Tools to simulate conversations between an application and a user, evaluating the entire interaction trajectory.
  • LangSmith Integration: Seamless logging of evaluation results to LangSmith for experiment tracking and analysis.

Maintenance & Community

The project is actively maintained by the LangChain AI team. Community engagement is encouraged through GitHub issues and X (@LangChainAI).

Licensing & Compatibility

The project appears to be under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

While comprehensive, the effectiveness of LLM-as-judge evaluators is dependent on the quality of the prompts and the capabilities of the judge LLM. Sandboxed execution requires E2B setup and API keys. Some code evaluators may ignore specific error types (e.g., reportMissingImports).

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
303 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena).

evalplus by evalplus

0.5%
2k
LLM code evaluation framework for rigorous testing
created 2 years ago
updated 4 weeks ago
Feedback? Help us improve.