SuperCLUE by CLUEbenchmark

Benchmark for Chinese foundation models

Created 2 years ago

3,271 stars

Top 14.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

SuperCLUE is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) specifically for the Chinese language. It targets researchers and developers working with Chinese LLMs, providing a standardized framework to assess performance across various dimensions, including language understanding, generation, specialized skills, AI agent capabilities, and safety.

How It Works

SuperCLUE evaluates LLMs across 12 core capabilities, categorized into four quadrants: Language Understanding & Generation, Professional Skills & Knowledge, AI Agent, and Safety. The benchmark utilizes a multi-dimensional evaluation approach, including both objective tests and subjective assessments judged by advanced models like GPT-4 Turbo. This methodology aims to provide a holistic and nuanced understanding of model performance in real-world Chinese language scenarios.

Quick Start & Requirements

The project provides detailed leaderboards and technical reports, but no direct installation or execution commands are present in the README. Access to the benchmark likely involves interacting with the models or datasets described in the reports.

Highlighted Details

Evaluates 12 fundamental capabilities across four key quadrants for Chinese LLMs.
Includes a dedicated benchmark for AI Agent capabilities, focusing on tool use and task planning.
Regularly updated leaderboards feature prominent Chinese LLMs and global models.
Benchmark methodology has been refined, with increased test set size and upgraded evaluation models (e.g., GPT-4 Turbo).

Maintenance & Community

The project is actively maintained, with regular updates to leaderboards and benchmark reports. The README encourages contact and collaboration from interested individuals and institutions.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

The README focuses on the benchmark's scope and methodology, with no explicit mention of limitations, known bugs, or alpha status. The evaluation relies on GPT-4 Turbo as a judge, which may introduce biases inherent to the judge model.

SuperCLUE by CLUEbenchmark

Explore Similar Projects

llm_benchmark by llm2014

GPT-Fathom by GPT-Fathom

mcpmark by eval-sys

InfiniteBench by OpenBMB

AGIEval by ruixiangcui

CMMLU by haonan-li

TravelPlanner by OSU-NLP-Group

agent-as-a-judge by metauto-ai

openbench by groq

TheAgentCompany by TheAgentCompany

arena-hard-auto by lmarena

Qwen2.5-Math by QwenLM