agent-ci by pegasi-ai

AI testing framework for LLM output validation

Created 2 years ago

356 stars

Top 78.8% on SourcePulse

1 Expert Loves This Project

pgarbacki

Cofounder of Fireworks AI

Project Summary

Feather is a lightweight framework for statistical testing and validation of LLM outputs and behaviors, designed for AI developers and researchers. It enables the creation of comprehensive test suites, automated evaluations, and behavioral checks to ensure AI application reliability and adherence to requirements.

How It Works

Feather focuses on statistical testing, evaluations with quantitative and qualitative metrics, and simple safety validations. This approach allows for robust assessment of model behavior and output quality, ensuring consistency and correctness in AI applications.

Quick Start & Requirements

Install via pip install pegasi-ai.
Requires an API key from app.pegasi.ai.
See the Evals notebook for a quick start.

Highlighted Details

Provides a comprehensive testing suite for model behavior validation.
Supports quantitative and qualitative metrics for performance measurement.
Includes simple safety checks and output validation capabilities.

Maintenance & Community

The project has established AI validators and out-of-the-box judges. Future plans include distribution-based testing, expanded statistical validation tools, improved test result visualization, custom test case creation, and community-driven test suites.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The framework is currently under active development, with features like distribution-based testing, advanced statistical validation, and custom test case creation still on the roadmap.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

testza by MarvinJWendt

Go test framework for assertions, fuzzing, and more

Created 4 years ago

Updated 2 years ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Marc Klingen

Marc Klingen(Cofounder of Langfuse), and

2 more.

phasellm by wgryc

LLM evaluation and workflow framework

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

code-eval by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

Created 2 years ago

Updated 2 years ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

arc-agi-benchmarking by arcprize

CLI tool for benchmarking LLMs on ARC-AGI tasks

Created 1 year ago

Updated 2 weeks ago

PandaLM by WeOpenML

LLM evaluation benchmark for reproducible, automated assessment

Created 2 years ago

Updated 1 year ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Edward Z. Yang

Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and

5 more.

yet-another-applied-llm-benchmark by carlini

LLM benchmark for evaluating models on previously asked programming questions

Created 2 years ago

Updated 10 months ago

Starred by

Marc Klingen

Marc Klingen(Cofounder of Langfuse),

Vasek Mlejnsky

Vasek Mlejnsky(Cofounder of E2B), and

1 more.

openevals by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 1 year ago

Updated 10 hours ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

LiveBench by LiveBench

LLM benchmark for evaluating models on recently released data

Created 1 year ago

Updated 2 hours ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

LiveCodeBench by LiveCodeBench

Benchmark for holistic LLM code evaluation

Created 1 year ago

Updated 7 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Jared Palmer

Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and

3 more.

human-eval by openai

Evaluation harness for LLMs trained on code

Created 4 years ago

Updated 1 year ago

Starred by

Didier Lopes

Didier Lopes(Founder of OpenBB),

Travis Fischer

Travis Fischer(Founder of Agentic), and

16 more.

promptfoo by promptfoo

CLI tool for LLM prompt/agent/RAG testing and red-teaming

Created 2 years ago

Updated 2 hours ago

Starred by

Gregor Zunic

Gregor Zunic(Cofounder of Browser Use),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

15 more.

ragas by vibrantlabsai

Toolkit for LLM application evaluation

Created 2 years ago

Updated 2 days ago

Feedback? Help us improve.