LLM evaluation framework
Top 18.0% on sourcepulse
PromptBench is a PyTorch-based Python library for evaluating Large Language Models (LLMs), offering a unified framework for researchers to assess model performance, test prompt engineering techniques, and analyze robustness against adversarial attacks. It supports a wide array of language and multi-modal datasets and models, including both open-source and proprietary options.
How It Works
PromptBench provides a modular architecture that integrates various components for LLM evaluation. It supports standard evaluation protocols, dynamic evaluation methods like DyVal to mitigate data contamination, and efficient multi-prompt evaluation via PromptEval. The framework allows for the implementation and testing of diverse prompt engineering strategies and adversarial attacks, enabling comprehensive analysis of LLM behavior and robustness.
Quick Start & Requirements
pip install promptbench
requirements.txt
within a Python 3.9 conda environment.examples/basic.ipynb
, examples/multimodal.ipynb
, examples/prompt_attack.ipynb
, examples/dyval.ipynb
, examples/efficient_multi_prompt_eval.ipynb
Highlighted Details
Maintenance & Community
The project is actively maintained by Microsoft, with recent updates including support for GPT-4o, Gemini, Mistral, and multi-modal capabilities. Contributions are welcomed via pull requests, with a Contributor License Agreement (CLA) required. The project follows the Microsoft Open Source Code of Conduct.
Licensing & Compatibility
The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The pip installation may lag behind the latest GitHub commits. While extensive, the framework's complexity might require a learning curve for users new to LLM evaluation methodologies.
3 weeks ago
Inactive