SuperCLUE  by CLUEbenchmark

Benchmark for Chinese foundation models

Created 2 years ago
3,256 stars

Top 14.8% on SourcePulse

GitHubView on GitHub
Project Summary

SuperCLUE is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) specifically for the Chinese language. It targets researchers and developers working with Chinese LLMs, providing a standardized framework to assess performance across various dimensions, including language understanding, generation, specialized skills, AI agent capabilities, and safety.

How It Works

SuperCLUE evaluates LLMs across 12 core capabilities, categorized into four quadrants: Language Understanding & Generation, Professional Skills & Knowledge, AI Agent, and Safety. The benchmark utilizes a multi-dimensional evaluation approach, including both objective tests and subjective assessments judged by advanced models like GPT-4 Turbo. This methodology aims to provide a holistic and nuanced understanding of model performance in real-world Chinese language scenarios.

Quick Start & Requirements

The project provides detailed leaderboards and technical reports, but no direct installation or execution commands are present in the README. Access to the benchmark likely involves interacting with the models or datasets described in the reports.

Highlighted Details

  • Evaluates 12 fundamental capabilities across four key quadrants for Chinese LLMs.
  • Includes a dedicated benchmark for AI Agent capabilities, focusing on tool use and task planning.
  • Regularly updated leaderboards feature prominent Chinese LLMs and global models.
  • Benchmark methodology has been refined, with increased test set size and upgraded evaluation models (e.g., GPT-4 Turbo).

Maintenance & Community

The project is actively maintained, with regular updates to leaderboards and benchmark reports. The README encourages contact and collaboration from interested individuals and institutions.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

The README focuses on the benchmark's scope and methodology, with no explicit mention of limitations, known bugs, or alpha status. The evaluation relies on GPT-4 Turbo as a judge, which may introduce biases inherent to the judge model.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

AGIEval by ruixiangcui

0.1%
763
Benchmark for evaluating foundation models on human-centric tasks
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.