opencompass by open-compass

LLM evaluation platform for assessing model capabilities across diverse datasets

Created 2 years ago

6,686 stars

Top 7.6% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Binyuan Hui

Research Scientist at Alibaba Qwen

and 1 more!

Project Summary

OpenCompass is a comprehensive LLM evaluation platform designed for researchers and developers to assess the performance of various large language models across a wide array of datasets. It offers a standardized, reproducible, and extensible framework for evaluating capabilities such as knowledge, reasoning, coding, and instruction following, aiming to provide a fair and open benchmark.

How It Works

OpenCompass employs a modular design that supports a diverse range of models (including HuggingFace and API-based) and over 100 datasets. It facilitates efficient distributed evaluation, allowing for rapid assessment of large models. The platform supports multiple evaluation paradigms like zero-shot, few-shot, and chain-of-thought prompting with customizable prompt templates, enabling users to elicit maximum model performance.

Quick Start & Requirements

Installation: pip install -U opencompass (full installation: pip install "opencompass[full]") or from source.
Prerequisites: Python 3.10+ recommended. Optional support for lmdeploy or vLLM requires separate installation. Dataset preparation can be done via download or automatic loading with ModelScope (pip install modelscope[framework]).
Resources: Evaluation can be resource-intensive, especially for large models and datasets.
Links: Website, Documentation, Installation Guide.

Highlighted Details

Supports over 100 datasets and 20+ HuggingFace and API models (e.g., Llama3, Mistral, GPT-4, Qwen).
Features efficient distributed evaluation, capable of completing billion-scale model evaluations in hours.
Offers diversified evaluation paradigms including zero-shot, few-shot, and chain-of-thought.
Highly extensible modular design for adding new models, datasets, or cluster management systems.
Includes CompassKit, CompassHub (benchmark browser), and CompassRank (leaderboards).

Maintenance & Community

The project is actively maintained with frequent updates and new feature additions. Community engagement is encouraged via Discord and WeChat. Links to the website, documentation, and issue reporting are provided.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, which permits commercial use and linking with closed-source software.

Limitations & Caveats

Version 0.4.0 introduced breaking changes consolidating configuration files. While supporting many models, users may need to follow specific steps for certain third-party features like Humaneval. The roadmap indicates ongoing development for features like a long-context leaderboard and coding evaluation leaderboard.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

91 stars in the last 30 days