promptbench  by microsoft

LLM evaluation framework

Created 2 years ago
2,711 stars

Top 17.4% on SourcePulse

GitHubView on GitHub
Project Summary

PromptBench is a PyTorch-based Python library for evaluating Large Language Models (LLMs), offering a unified framework for researchers to assess model performance, test prompt engineering techniques, and analyze robustness against adversarial attacks. It supports a wide array of language and multi-modal datasets and models, including both open-source and proprietary options.

How It Works

PromptBench provides a modular architecture that integrates various components for LLM evaluation. It supports standard evaluation protocols, dynamic evaluation methods like DyVal to mitigate data contamination, and efficient multi-prompt evaluation via PromptEval. The framework allows for the implementation and testing of diverse prompt engineering strategies and adversarial attacks, enabling comprehensive analysis of LLM behavior and robustness.

Quick Start & Requirements

  • Install via pip: pip install promptbench
  • Install from GitHub for latest features: Clone the repository and install dependencies using requirements.txt within a Python 3.9 conda environment.
  • Additional dependencies: TextAttack is required for Prompt Attacks.
  • Documentation: https://github.com/microsoft/promptbench
  • Tutorials: examples/basic.ipynb, examples/multimodal.ipynb, examples/prompt_attack.ipynb, examples/dyval.ipynb, examples/efficient_multi_prompt_eval.ipynb

Highlighted Details

  • Supports over 20 language datasets (e.g., GLUE, MMLU, GSM8K) and 10 multi-modal datasets (e.g., VQAv2, MMMU, ChartQA).
  • Integrates numerous LLMs, including Llama2, Mistral, Gemini, GPT-3.5, GPT-4, and multi-modal variants.
  • Implements advanced evaluation techniques like DyVal for dynamic sample generation and PromptEval for efficient multi-prompt evaluation.
  • Includes capabilities for prompt engineering methods (e.g., Chain-of-Thought, EmotionPrompt) and adversarial attacks (e.g., TextFooler, BertAttack).

Maintenance & Community

The project is actively maintained by Microsoft, with recent updates including support for GPT-4o, Gemini, Mistral, and multi-modal capabilities. Contributions are welcomed via pull requests, with a Contributor License Agreement (CLA) required. The project follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation may lag behind the latest GitHub commits. While extensive, the framework's complexity might require a learning curve for users new to LLM evaluation methodologies.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
2 more.

YiVal by YiVal

0.1%
2k
Prompt engineering assistant for GenAI apps
Created 2 years ago
Updated 1 year ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
34 more.

evals by openai

0.2%
17k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.