SuperCLUE  by CLUEbenchmark

Benchmark for Chinese foundation models

created 2 years ago
3,232 stars

Top 15.3% on sourcepulse

GitHubView on GitHub
Project Summary

SuperCLUE is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) specifically for the Chinese language. It targets researchers and developers working with Chinese LLMs, providing a standardized framework to assess performance across various dimensions, including language understanding, generation, specialized skills, AI agent capabilities, and safety.

How It Works

SuperCLUE evaluates LLMs across 12 core capabilities, categorized into four quadrants: Language Understanding & Generation, Professional Skills & Knowledge, AI Agent, and Safety. The benchmark utilizes a multi-dimensional evaluation approach, including both objective tests and subjective assessments judged by advanced models like GPT-4 Turbo. This methodology aims to provide a holistic and nuanced understanding of model performance in real-world Chinese language scenarios.

Quick Start & Requirements

The project provides detailed leaderboards and technical reports, but no direct installation or execution commands are present in the README. Access to the benchmark likely involves interacting with the models or datasets described in the reports.

Highlighted Details

  • Evaluates 12 fundamental capabilities across four key quadrants for Chinese LLMs.
  • Includes a dedicated benchmark for AI Agent capabilities, focusing on tool use and task planning.
  • Regularly updated leaderboards feature prominent Chinese LLMs and global models.
  • Benchmark methodology has been refined, with increased test set size and upgraded evaluation models (e.g., GPT-4 Turbo).

Maintenance & Community

The project is actively maintained, with regular updates to leaderboards and benchmark reports. The README encourages contact and collaboration from interested individuals and institutions.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

The README focuses on the benchmark's scope and methodology, with no explicit mention of limitations, known bugs, or alpha status. The evaluation relies on GPT-4 Turbo as a judge, which may introduce biases inherent to the judge model.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
80 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Luca Antiga Luca Antiga(CTO of Lightning AI), and
4 more.

helm by stanford-crfm

0.9%
2k
Open-source Python framework for holistic evaluation of foundation models
created 3 years ago
updated 1 day ago
Feedback? Help us improve.