giskard-oss  by Giskard-AI

Open-source testing framework for AI & LLM systems

Created 3 years ago
4,874 stars

Top 10.3% on SourcePulse

GitHubView on GitHub
Project Summary

Giskard is an open-source Python framework for evaluating and testing AI systems, including LLM-based applications like RAG agents and traditional ML models. It aims to identify and mitigate risks related to performance, bias, and security vulnerabilities, offering automated scanning and dataset generation for comprehensive quality assurance.

How It Works

Giskard automates the detection of issues such as hallucinations, prompt injection, and discrimination by analyzing model outputs against predefined or generated test cases. For RAG applications, its RAG Evaluation Toolkit (RAGET) can automatically generate question-answer pairs and relevant contexts from a knowledge base, enabling detailed evaluation of RAG components like the generator, retriever, and knowledge base itself.

Quick Start & Requirements

  • Install via pip: pip install "giskard[llm]" -U
  • Supported Python versions: 3.9, 3.10, 3.11.
  • For RAG evaluation, requires libraries like langchain, langchain-openai, tiktoken, and pypdf.
  • Example Colab notebook available.

Highlighted Details

  • Detects a wide range of LLM issues including hallucinations, harmful content, prompt injection, and bias.
  • RAGET automatically generates evaluation datasets and scores RAG components (Generator, Retriever, Rewriter, Router, Knowledge Base).
  • Integrates with any model and environment, with a separate library giskard-vision for computer vision tasks.
  • Provides a giskard.scan() function for automated issue detection and scan_results.generate_test_suite() for creating test suites.

Maintenance & Community

  • Active community on Discord.
  • Open to contributions with a contribution guide.
  • Sponsorships available via GitHub, with current sponsors including Lunary and Biolevate.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The RAGET testset generation can be time-consuming depending on the number of questions requested.
  • While supporting Python 3.9-3.11, newer Python versions might not be immediately compatible.
Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
14
Issues (30d)
1
Star History
77 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Feedback? Help us improve.