promptbench  by microsoft

LLM evaluation framework

Created 2 years ago
2,775 stars

Top 16.8% on SourcePulse

GitHubView on GitHub
Project Summary

PromptBench is a PyTorch-based Python library for evaluating Large Language Models (LLMs), offering a unified framework for researchers to assess model performance, test prompt engineering techniques, and analyze robustness against adversarial attacks. It supports a wide array of language and multi-modal datasets and models, including both open-source and proprietary options.

How It Works

PromptBench provides a modular architecture that integrates various components for LLM evaluation. It supports standard evaluation protocols, dynamic evaluation methods like DyVal to mitigate data contamination, and efficient multi-prompt evaluation via PromptEval. The framework allows for the implementation and testing of diverse prompt engineering strategies and adversarial attacks, enabling comprehensive analysis of LLM behavior and robustness.

Quick Start & Requirements

  • Install via pip: pip install promptbench
  • Install from GitHub for latest features: Clone the repository and install dependencies using requirements.txt within a Python 3.9 conda environment.
  • Additional dependencies: TextAttack is required for Prompt Attacks.
  • Documentation: https://github.com/microsoft/promptbench
  • Tutorials: examples/basic.ipynb, examples/multimodal.ipynb, examples/prompt_attack.ipynb, examples/dyval.ipynb, examples/efficient_multi_prompt_eval.ipynb

Highlighted Details

  • Supports over 20 language datasets (e.g., GLUE, MMLU, GSM8K) and 10 multi-modal datasets (e.g., VQAv2, MMMU, ChartQA).
  • Integrates numerous LLMs, including Llama2, Mistral, Gemini, GPT-3.5, GPT-4, and multi-modal variants.
  • Implements advanced evaluation techniques like DyVal for dynamic sample generation and PromptEval for efficient multi-prompt evaluation.
  • Includes capabilities for prompt engineering methods (e.g., Chain-of-Thought, EmotionPrompt) and adversarial attacks (e.g., TextFooler, BertAttack).

Maintenance & Community

The project is actively maintained by Microsoft, with recent updates including support for GPT-4o, Gemini, Mistral, and multi-modal capabilities. Contributions are welcomed via pull requests, with a Contributor License Agreement (CLA) required. The project follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation may lag behind the latest GitHub commits. While extensive, the framework's complexity might require a learning curve for users new to LLM evaluation methodologies.

Health Check
Last Commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
2 more.

YiVal by YiVal

0%
2k
Prompt engineering assistant for GenAI apps
Created 2 years ago
Updated 1 year ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
35 more.

evals by openai

1.2%
18k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
Created 3 years ago
Updated 3 months ago
Feedback? Help us improve.