promptbench by microsoft

LLM evaluation framework

Created 2 years ago

2,771 stars

Top 17.0% on SourcePulse

View on GitHub

5 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Pawel Garbacki

Cofounder of Fireworks AI

Nir Gazit

Cofounder of Traceloop

Travis Fischer

Founder of Agentic

and 1 more!

Project Summary

PromptBench is a PyTorch-based Python library for evaluating Large Language Models (LLMs), offering a unified framework for researchers to assess model performance, test prompt engineering techniques, and analyze robustness against adversarial attacks. It supports a wide array of language and multi-modal datasets and models, including both open-source and proprietary options.

How It Works

PromptBench provides a modular architecture that integrates various components for LLM evaluation. It supports standard evaluation protocols, dynamic evaluation methods like DyVal to mitigate data contamination, and efficient multi-prompt evaluation via PromptEval. The framework allows for the implementation and testing of diverse prompt engineering strategies and adversarial attacks, enabling comprehensive analysis of LLM behavior and robustness.

Quick Start & Requirements

Install via pip: pip install promptbench
Install from GitHub for latest features: Clone the repository and install dependencies using requirements.txt within a Python 3.9 conda environment.
Additional dependencies: TextAttack is required for Prompt Attacks.
Documentation: https://github.com/microsoft/promptbench
Tutorials: examples/basic.ipynb, examples/multimodal.ipynb, examples/prompt_attack.ipynb, examples/dyval.ipynb, examples/efficient_multi_prompt_eval.ipynb

Highlighted Details

Supports over 20 language datasets (e.g., GLUE, MMLU, GSM8K) and 10 multi-modal datasets (e.g., VQAv2, MMMU, ChartQA).
Integrates numerous LLMs, including Llama2, Mistral, Gemini, GPT-3.5, GPT-4, and multi-modal variants.
Implements advanced evaluation techniques like DyVal for dynamic sample generation and PromptEval for efficient multi-prompt evaluation.
Includes capabilities for prompt engineering methods (e.g., Chain-of-Thought, EmotionPrompt) and adversarial attacks (e.g., TextFooler, BertAttack).

Maintenance & Community

The project is actively maintained by Microsoft, with recent updates including support for GPT-4o, Gemini, Mistral, and multi-modal capabilities. Contributions are welcomed via pull requests, with a Contributor License Agreement (CLA) required. The project follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation may lag behind the latest GitHub commits. While extensive, the framework's complexity might require a learning curve for users new to LLM evaluation methodologies.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days