shield  by pegasi-ai

AI testing framework for LLM output validation

Created 2 years ago
355 stars

Top 78.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Feather is a lightweight framework for statistical testing and validation of LLM outputs and behaviors, designed for AI developers and researchers. It enables the creation of comprehensive test suites, automated evaluations, and behavioral checks to ensure AI application reliability and adherence to requirements.

How It Works

Feather focuses on statistical testing, evaluations with quantitative and qualitative metrics, and simple safety validations. This approach allows for robust assessment of model behavior and output quality, ensuring consistency and correctness in AI applications.

Quick Start & Requirements

  • Install via pip install pegasi-ai.
  • Requires an API key from app.pegasi.ai.
  • See the Evals notebook for a quick start.

Highlighted Details

  • Provides a comprehensive testing suite for model behavior validation.
  • Supports quantitative and qualitative metrics for performance measurement.
  • Includes simple safety checks and output validation capabilities.

Maintenance & Community

The project has established AI validators and out-of-the-box judges. Future plans include distribution-based testing, expanded statistical validation tools, improved test result visualization, custom test case creation, and community-driven test suites.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The framework is currently under active development, with features like distribution-based testing, advanced statistical validation, and custom test case creation still on the roadmap.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Feedback? Help us improve.