LEval by OpenLMLab

Benchmark for long-context language model evaluation

Created 2 years ago

390 stars

Top 73.6% on SourcePulse

Project Summary

L-Eval is a comprehensive benchmark suite designed to evaluate the capabilities of Long Context Language Models (LCLMs). It addresses the need for standardized evaluation across diverse tasks and document lengths (3k-200k tokens), targeting researchers and developers working with LCLMs. The benchmark offers 20 sub-tasks, over 500 documents, and 2,000 human-labeled query-response pairs, aiming to provide a more nuanced understanding of model performance beyond traditional n-gram metrics.

How It Works

L-Eval employs a multi-faceted evaluation approach. For closed-ended tasks (e.g., multiple-choice, math problems), it uses exact match metrics. For open-ended tasks (e.g., summarization, question answering), it moves beyond standard ROUGE and F1 scores, primarily utilizing Length-Instruction-Enhanced (LIE) evaluation and LLM judges (GPT-4, Turbo-16k) to better capture nuanced generation quality and reduce length bias. This hybrid approach aims for more reliable and informative assessments of LCLMs.

Quick Start & Requirements

Install/Run: Load data via Hugging Face Datasets (load_dataset('L4NLP/LEval', testset, split='test')) or clone the repository. Evaluation scripts are provided in the Evaluation/ directory.
Prerequisites: Python >= 3.8, PyTorch (e.g., 1.13.1+cu117), CUDA Toolkit (e.g., 11.7). Flash Attention v2 is recommended for memory efficiency. Optional: Elasticsearch for BM25 retrieval, API keys for Ada embedding.
Resources: Baseline model testing typically requires an 80G A800 GPU. Memory-efficient inference with LightLLM is supported for 24G GPUs.
Links: HuggingFace Datasets, Leaderboard, Paper

Highlighted Details

ACL'24 Outstanding Paper award.
Supports evaluation using LLM judges (GPT-4, Turbo-16k) and human annotation via a Flask web app.
Includes scripts for retrieval-based baselines using Langchain.
Offers memory-efficient inference options with Flash Attention and LightLLM.

Maintenance & Community

Supported by OpenCompass.
Primary contributors from Fudan University and The University of Hong Kong.
Contact: cxan20@fudan.edu.cn

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on and cites numerous other open-source datasets, each with its own license. Users should verify compatibility for commercial use based on the underlying dataset licenses.

Limitations & Caveats

Some datasets (e.g., codeU, sci_fi) may require disabling Hugging Face caching for download.
LightLLM server processes might not terminate cleanly.
The README mentions potential CUDA OOM issues and provides mitigation strategies, indicating that running all tasks may require significant GPU resources.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days