deepeval by confident-ai

LLM evaluation framework for unit testing LLM outputs

Created 2 years ago

12,960 stars

Top 3.9% on SourcePulse

View on GitHub

11 Experts Love This Project

Michael Chiang

Cofounder of Ollama

Magnus Müller

Cofounder of Browser Use

Gregor Zunic

Cofounder of Browser Use

Gabriel Almeida

Cofounder of Langflow

and 7 more!

Project Summary

DeepEval is an open-source LLM evaluation framework designed for developers and researchers building LLM applications. It provides a Pytest-like experience for unit testing LLM outputs, incorporating advanced metrics and research to assess aspects like hallucination, relevancy, and RAG performance, enabling confident iteration and deployment of LLM systems.

How It Works

DeepEval leverages a modular design allowing users to select from a wide array of pre-built metrics or create custom ones. These metrics can be powered by various LLMs, statistical methods, or local NLP models. The framework supports both Pytest integration for CI/CD pipelines and standalone evaluation for notebook environments, facilitating systematic testing of LLM responses against defined criteria.

Quick Start & Requirements

Install via pip: pip install -U deepeval
Requires an OpenAI API key or a custom model setup.
Optional: DeepEval platform account for cloud reporting.
See: Getting Started

Highlighted Details

Supports over 40 safety vulnerabilities for red-teaming LLMs.
Integrates with LlamaIndex and Hugging Face for RAG and fine-tuning evaluations.
Offers benchmarking against popular LLM benchmarks like MMLU and HumanEval.
Enables custom metric creation and synthetic dataset generation.

Maintenance & Community

Developed by the founders of Confident AI.
Community support available via Discord.
Roadmap includes DAG custom metrics and Guardrails.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The framework relies on external LLM APIs (like OpenAI) for many metrics, which may incur costs or require API key management. Some advanced features like DAG custom metrics are still under development.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

403 stars in the last 30 days