Benchmark for long-context LLM evaluation
Top 39.9% on sourcepulse
LongBench v2 is a benchmark designed to evaluate the deep understanding and reasoning capabilities of Large Language Models (LLMs) on realistic, long-context tasks. It targets researchers and developers building advanced LLMs, offering a challenging and comprehensive dataset to push the boundaries of long-context AI.
How It Works
LongBench v2 utilizes a dataset of 503 multiple-choice questions with contexts ranging from 8k to 2M words, focusing on six task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. The benchmark emphasizes difficulty and realism, with human experts achieving only 53.7% accuracy under a 15-minute time constraint, highlighting the need for enhanced reasoning and scaled inference-time compute.
Quick Start & Requirements
pip install -r requirements.txt
load_dataset('THUDM/LongBench-v2', split='train')
pred.py
, and export results with result.py
.Highlighted Details
Maintenance & Community
The project is associated with THUDM and has multiple academic citations, indicating active research and development. The primary interaction point appears to be through the GitHub repository.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for any commercial or closed-source integration.
Limitations & Caveats
The evaluation process requires deploying models using vLLM, which may introduce specific hardware and configuration requirements. The benchmark is designed for advanced LLMs, and performance on smaller or less capable models may be limited.
6 months ago
1 day