benchllm  by v7labs

CI for LLM applications

Created 2 years ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

BenchLLM is an open-source Python library designed for continuous integration and rigorous testing of LLM-powered applications, agents, and chains. It addresses the challenge of ensuring accuracy and reliability in AI-driven systems by systematically validating model responses against expected outputs, thereby helping developers build confidence in their LLM code and identify inaccuracies or hallucinations early.

How It Works

The library employs a two-step methodology: first, a "Testing" phase captures model predictions for given inputs without immediate judgment. Second, an "Evaluation" phase uses LLMs (defaulting to OpenAI's GPT-3) or other methods to compare these predictions against predefined expected responses, generating detailed reports. This separation allows for granular control and comprehensive performance analysis.

Quick Start & Requirements

Installation is straightforward via pip: pip install benchllm. The default semantic evaluation requires an OPENAI_API_KEY environment variable. To initiate testing, use the bench run command, optionally specifying target files or folders. BenchLLM is developed for Python 3.10 and recommends pip >= 23. Links to GitHub for contributions and Discord/Twitter for support are available.

Highlighted Details

  • Flexible Evaluation: Supports multiple evaluators including semantic (LLM-based), embedding (cosine distance), string matching, interactive manual checks, and a web UI.
  • Caching: Implements caching mechanisms (memory, file default, none) to accelerate repeated evaluations.
  • Function Mocking: Allows mocking of external function calls within LLM chains or agents to ensure test predictability and discover unexpected interactions.
  • API Access: Provides a programmatic API using Test, Tester, and Evaluator objects for advanced control beyond CLI commands.
  • Parallelism: Evaluation jobs can be run in parallel using the --workers N parameter.

Maintenance & Community

BenchLLM is actively used internally at V7 and is open-sourced under the MIT license. The project is noted to be in an early stage of development with potential for rapid changes. Contributions are welcomed via GitHub issues and pull requests, following PEP8 guidelines. Community support is available on Discord and Twitter.

Licensing & Compatibility

The project is released under the permissive MIT License, allowing for broad compatibility with commercial and closed-source applications.

Limitations & Caveats

BenchLLM is explicitly stated to be in the early stages of development, implying potential for breaking changes and evolving features. The default semantic evaluation relies on OpenAI's API, requiring an API key and incurring associated costs.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Meng Zhang Meng Zhang(Cofounder of TabbyML), and
3 more.

qodo-cover by qodo-ai

0.1%
5k
CLI tool for AI-powered test generation and code coverage enhancement
Created 1 year ago
Updated 4 months ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

opik by comet-ml

1.2%
15k
Open-source LLM evaluation framework for RAG, agents, and more
Created 2 years ago
Updated 15 hours ago
Feedback? Help us improve.