LLM evaluation platform for assessing model capabilities across diverse datasets
Top 9.1% on sourcepulse
OpenCompass is a comprehensive LLM evaluation platform designed for researchers and developers to assess the performance of various large language models across a wide array of datasets. It offers a standardized, reproducible, and extensible framework for evaluating capabilities such as knowledge, reasoning, coding, and instruction following, aiming to provide a fair and open benchmark.
How It Works
OpenCompass employs a modular design that supports a diverse range of models (including HuggingFace and API-based) and over 100 datasets. It facilitates efficient distributed evaluation, allowing for rapid assessment of large models. The platform supports multiple evaluation paradigms like zero-shot, few-shot, and chain-of-thought prompting with customizable prompt templates, enabling users to elicit maximum model performance.
Quick Start & Requirements
pip install -U opencompass
(full installation: pip install "opencompass[full]"
) or from source.lmdeploy
or vLLM
requires separate installation. Dataset preparation can be done via download or automatic loading with ModelScope (pip install modelscope[framework]
).Highlighted Details
Maintenance & Community
The project is actively maintained with frequent updates and new feature additions. Community engagement is encouraged via Discord and WeChat. Links to the website, documentation, and issue reporting are provided.
Licensing & Compatibility
The project is licensed under the Apache-2.0 license, which permits commercial use and linking with closed-source software.
Limitations & Caveats
Version 0.4.0 introduced breaking changes consolidating configuration files. While supporting many models, users may need to follow specific steps for certain third-party features like Humaneval. The roadmap indicates ongoing development for features like a long-context leaderboard and coding evaluation leaderboard.
1 day ago
1 day