LongBench  by THUDM

Benchmark for long-context LLM evaluation

created 2 years ago
938 stars

Top 39.9% on sourcepulse

GitHubView on GitHub
Project Summary

LongBench v2 is a benchmark designed to evaluate the deep understanding and reasoning capabilities of Large Language Models (LLMs) on realistic, long-context tasks. It targets researchers and developers building advanced LLMs, offering a challenging and comprehensive dataset to push the boundaries of long-context AI.

How It Works

LongBench v2 utilizes a dataset of 503 multiple-choice questions with contexts ranging from 8k to 2M words, focusing on six task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. The benchmark emphasizes difficulty and realism, with human experts achieving only 53.7% accuracy under a 15-minute time constraint, highlighting the need for enhanced reasoning and scaled inference-time compute.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Data Loading: Use Hugging Face datasets: load_dataset('THUDM/LongBench-v2', split='train')
  • Prerequisites: Python, vLLM for model deployment. Specific model requirements depend on the LLM being evaluated.
  • Evaluation: Deploy model with vLLM, run inference via pred.py, and export results with result.py.
  • Links: Project Page, LongBench v2 Paper, LongBench v2 Dataset

Highlighted Details

  • Context lengths up to 2 million words.
  • Designed to be challenging, exceeding human expert performance.
  • Multiple-choice format ensures reliable evaluation.
  • Supports Chain-of-Thought (CoT) and Retrieval-Augmented Generation (RAG) evaluation modes.

Maintenance & Community

The project is associated with THUDM and has multiple academic citations, indicating active research and development. The primary interaction point appears to be through the GitHub repository.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for any commercial or closed-source integration.

Limitations & Caveats

The evaluation process requires deploying models using vLLM, which may introduce specific hardware and configuration requirements. The benchmark is designed for advanced LLMs, and performance on smaller or less capable models may be limited.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
78 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.