Discover and explore top open-source AI tools and projects—updated daily.
Tencent-HunyuanBenchmark for language model context learning
Top 69.9% on SourcePulse
CL-bench addresses the critical gap in evaluating Large Language Models' (LLMs) ability to learn and apply novel, context-specific knowledge—essential for real-world deployment beyond static pre-training. This benchmark is designed for researchers and engineers seeking to build more intelligent, adaptable LLMs, offering a rigorous method to assess context learning.
How It Works
The benchmark comprises instances with a system prompt, task, and context containing new, pre-training-absent knowledge. Models must learn from this context to solve tasks across domain reasoning, rule systems, procedural execution, and empirical discovery. Instances are meticulously crafted by domain experts, ensuring realism and quality, and are accompanied by detailed rubrics for multi-dimensional verification. This "contamination-free" approach ensures evaluation focuses on genuine context learning.
Quick Start & Requirements
Installation requires pip install openai tqdm. Prerequisites include an OpenAI API key (or compatible) and the CL-bench.jsonl dataset from Hugging Face. Inference: python infer.py --model <model_name> --input CL-bench.jsonl --output <output_path>. Evaluation: python eval.py --input <output_path> --judge-model <judge_model_name>. Key resources: Leaderboard (www.clbench.com), Paper (arxiv.org/abs/2602.03587), Data (huggingface.co/datasets/tencent/CL-bench), Blog (hy.tencent.com/research/100025?langVersion=en).
Highlighted Details
Maintenance & Community
Direct contact is available via email for Shihan Dou (shihandou@foxmail.com) and Ming Zhang (mingzhang23@m.fudan.edu.cn). No community channels (e.g., Slack, Discord) or public roadmaps are detailed in the README.
Licensing & Compatibility
The README does not specify a software license. This omission requires clarification for any adoption, particularly concerning commercial use or integration with proprietary systems.
Limitations & Caveats
Current state-of-the-art LLMs exhibit very low performance on CL-bench, with the best model scoring 23.7%, indicating robust context learning remains a significant challenge. The absence of a stated license is a critical caveat for adoption.
2 weeks ago
Inactive
txsun1997
FranxYao