LongBench by THUDM

Benchmark for long-context LLM evaluation

Created 2 years ago

1,064 stars

Top 35.6% on SourcePulse

Project Summary

LongBench v2 is a benchmark designed to evaluate the deep understanding and reasoning capabilities of Large Language Models (LLMs) on realistic, long-context tasks. It targets researchers and developers building advanced LLMs, offering a challenging and comprehensive dataset to push the boundaries of long-context AI.

How It Works

LongBench v2 utilizes a dataset of 503 multiple-choice questions with contexts ranging from 8k to 2M words, focusing on six task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. The benchmark emphasizes difficulty and realism, with human experts achieving only 53.7% accuracy under a 15-minute time constraint, highlighting the need for enhanced reasoning and scaled inference-time compute.

Quick Start & Requirements

Install: pip install -r requirements.txt
Data Loading: Use Hugging Face datasets: load_dataset('THUDM/LongBench-v2', split='train')
Prerequisites: Python, vLLM for model deployment. Specific model requirements depend on the LLM being evaluated.
Evaluation: Deploy model with vLLM, run inference via pred.py, and export results with result.py.
Links: Project Page, LongBench v2 Paper, LongBench v2 Dataset

Highlighted Details

Context lengths up to 2 million words.
Designed to be challenging, exceeding human expert performance.
Multiple-choice format ensures reliable evaluation.
Supports Chain-of-Thought (CoT) and Retrieval-Augmented Generation (RAG) evaluation modes.

Maintenance & Community

The project is associated with THUDM and has multiple academic citations, indicating active research and development. The primary interaction point appears to be through the GitHub repository.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for any commercial or closed-source integration.

Limitations & Caveats

The evaluation process requires deploying models using vLLM, which may introduce specific hardware and configuration requirements. The benchmark is designed for advanced LLMs, and performance on smaller or less capable models may be limited.

LongBench by THUDM

Explore Similar Projects

LEval by OpenLMLab

FILM by microsoft

NBCE by bojone

Qwen-Doc by Tongyi-Zhiwen

InfiniteBench by OpenBMB

OpenICL by Shark-NLP

FlagEval by flageval-baai

llm_benchmarks by leobeeson

UnifiedSKG by xlang-ai

long_llama by CStanKonrad

chain-of-thought-hub by FranxYao

LLMTest_NeedleInAHaystack by gkamradt