promptbench  by microsoft

LLM evaluation framework

created 2 years ago
2,673 stars

Top 18.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

PromptBench is a PyTorch-based Python library for evaluating Large Language Models (LLMs), offering a unified framework for researchers to assess model performance, test prompt engineering techniques, and analyze robustness against adversarial attacks. It supports a wide array of language and multi-modal datasets and models, including both open-source and proprietary options.

How It Works

PromptBench provides a modular architecture that integrates various components for LLM evaluation. It supports standard evaluation protocols, dynamic evaluation methods like DyVal to mitigate data contamination, and efficient multi-prompt evaluation via PromptEval. The framework allows for the implementation and testing of diverse prompt engineering strategies and adversarial attacks, enabling comprehensive analysis of LLM behavior and robustness.

Quick Start & Requirements

  • Install via pip: pip install promptbench
  • Install from GitHub for latest features: Clone the repository and install dependencies using requirements.txt within a Python 3.9 conda environment.
  • Additional dependencies: TextAttack is required for Prompt Attacks.
  • Documentation: https://github.com/microsoft/promptbench
  • Tutorials: examples/basic.ipynb, examples/multimodal.ipynb, examples/prompt_attack.ipynb, examples/dyval.ipynb, examples/efficient_multi_prompt_eval.ipynb

Highlighted Details

  • Supports over 20 language datasets (e.g., GLUE, MMLU, GSM8K) and 10 multi-modal datasets (e.g., VQAv2, MMMU, ChartQA).
  • Integrates numerous LLMs, including Llama2, Mistral, Gemini, GPT-3.5, GPT-4, and multi-modal variants.
  • Implements advanced evaluation techniques like DyVal for dynamic sample generation and PromptEval for efficient multi-prompt evaluation.
  • Includes capabilities for prompt engineering methods (e.g., Chain-of-Thought, EmotionPrompt) and adversarial attacks (e.g., TextFooler, BertAttack).

Maintenance & Community

The project is actively maintained by Microsoft, with recent updates including support for GPT-4o, Gemini, Mistral, and multi-modal capabilities. Contributions are welcomed via pull requests, with a Contributor License Agreement (CLA) required. The project follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation may lag behind the latest GitHub commits. While extensive, the framework's complexity might require a learning curve for users new to LLM evaluation methodologies.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
84 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

prompt-engine by microsoft

0.0%
3k
NPM library for LLM prompt engineering
created 3 years ago
updated 2 years ago
Feedback? Help us improve.