CL-bench by Tencent-Hunyuan

Benchmark for language model context learning

Created 5 months ago

564 stars

Top 56.3% on SourcePulse

Project Summary

CL-bench addresses the critical gap in evaluating Large Language Models' (LLMs) ability to learn and apply novel, context-specific knowledge—essential for real-world deployment beyond static pre-training. This benchmark is designed for researchers and engineers seeking to build more intelligent, adaptable LLMs, offering a rigorous method to assess context learning.

How It Works

The benchmark comprises instances with a system prompt, task, and context containing new, pre-training-absent knowledge. Models must learn from this context to solve tasks across domain reasoning, rule systems, procedural execution, and empirical discovery. Instances are meticulously crafted by domain experts, ensuring realism and quality, and are accompanied by detailed rubrics for multi-dimensional verification. This "contamination-free" approach ensures evaluation focuses on genuine context learning.

Quick Start & Requirements

Installation requires pip install openai tqdm. Prerequisites include an OpenAI API key (or compatible) and the CL-bench.jsonl dataset from Hugging Face. Inference: python infer.py --model <model_name> --input CL-bench.jsonl --output <output_path>. Evaluation: python eval.py --input <output_path> --judge-model <judge_model_name>. Key resources: Leaderboard (www.clbench.com), Paper (arxiv.org/abs/2602.03587), Data (huggingface.co/datasets/tencent/CL-bench), Blog (hy.tencent.com/research/100025?langVersion=en).

Highlighted Details

Realistic & High-quality: Expert-crafted contexts, tasks, and rubrics ensure high fidelity.
Contamination-free: Contexts introduce novel knowledge, preventing reliance on pre-training data.
Challenging: Current top models achieve only 23.7% accuracy, highlighting significant difficulty.
Rigorously Verifiable: An average of 63.2 expert-annotated rubrics per context ensure thorough evaluation.
Self-contained: All necessary information is provided within the context, negating external retrieval needs.
Dataset: Features 1,899 tasks in JSONL format.

Maintenance & Community

Direct contact is available via email for Shihan Dou (shihandou@foxmail.com) and Ming Zhang (mingzhang23@m.fudan.edu.cn). No community channels (e.g., Slack, Discord) or public roadmaps are detailed in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for any adoption, particularly concerning commercial use or integration with proprietary systems.

Limitations & Caveats

Current state-of-the-art LLMs exhibit very low performance on CL-bench, with the best model scoring 23.7%, indicating robust context learning remains a significant challenge. The absence of a stated license is a critical caveat for adoption.

CL-bench by Tencent-Hunyuan

Explore Similar Projects

MMBench by open-compass

Awesome_Multimodel_LLM by Atomic-man007

babilong by booydar

LMaaS-Papers by txsun1997

OpenICL by Shark-NLP

InfiniteBench by OpenBMB

llm_benchmarks by leobeeson

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Awesome-LLM-Eval by onejune2018

LongBench by THUDM

chain-of-thought-hub by FranxYao

LLMSurvey by RUCAIBox