CL-bench  by Tencent-Hunyuan

Benchmark for language model context learning

Created 1 month ago
422 stars

Top 69.9% on SourcePulse

GitHubView on GitHub
Project Summary

CL-bench addresses the critical gap in evaluating Large Language Models' (LLMs) ability to learn and apply novel, context-specific knowledge—essential for real-world deployment beyond static pre-training. This benchmark is designed for researchers and engineers seeking to build more intelligent, adaptable LLMs, offering a rigorous method to assess context learning.

How It Works

The benchmark comprises instances with a system prompt, task, and context containing new, pre-training-absent knowledge. Models must learn from this context to solve tasks across domain reasoning, rule systems, procedural execution, and empirical discovery. Instances are meticulously crafted by domain experts, ensuring realism and quality, and are accompanied by detailed rubrics for multi-dimensional verification. This "contamination-free" approach ensures evaluation focuses on genuine context learning.

Quick Start & Requirements

Installation requires pip install openai tqdm. Prerequisites include an OpenAI API key (or compatible) and the CL-bench.jsonl dataset from Hugging Face. Inference: python infer.py --model <model_name> --input CL-bench.jsonl --output <output_path>. Evaluation: python eval.py --input <output_path> --judge-model <judge_model_name>. Key resources: Leaderboard (www.clbench.com), Paper (arxiv.org/abs/2602.03587), Data (huggingface.co/datasets/tencent/CL-bench), Blog (hy.tencent.com/research/100025?langVersion=en).

Highlighted Details

  • Realistic & High-quality: Expert-crafted contexts, tasks, and rubrics ensure high fidelity.
  • Contamination-free: Contexts introduce novel knowledge, preventing reliance on pre-training data.
  • Challenging: Current top models achieve only 23.7% accuracy, highlighting significant difficulty.
  • Rigorously Verifiable: An average of 63.2 expert-annotated rubrics per context ensure thorough evaluation.
  • Self-contained: All necessary information is provided within the context, negating external retrieval needs.
  • Dataset: Features 1,899 tasks in JSONL format.

Maintenance & Community

Direct contact is available via email for Shihan Dou (shihandou@foxmail.com) and Ming Zhang (mingzhang23@m.fudan.edu.cn). No community channels (e.g., Slack, Discord) or public roadmaps are detailed in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for any adoption, particularly concerning commercial use or integration with proprietary systems.

Limitations & Caveats

Current state-of-the-art LLMs exhibit very low performance on CL-bench, with the best model scoring 23.7%, indicating robust context learning remains a significant challenge. The absence of a stated license is a critical caveat for adoption.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
9
Star History
430 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

LMaaS-Papers by txsun1997

0%
544
Curated list of LMaaS research papers
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.