LEval  by OpenLMLab

Benchmark for long-context language model evaluation

created 2 years ago
388 stars

Top 75.0% on sourcepulse

GitHubView on GitHub
Project Summary

L-Eval is a comprehensive benchmark suite designed to evaluate the capabilities of Long Context Language Models (LCLMs). It addresses the need for standardized evaluation across diverse tasks and document lengths (3k-200k tokens), targeting researchers and developers working with LCLMs. The benchmark offers 20 sub-tasks, over 500 documents, and 2,000 human-labeled query-response pairs, aiming to provide a more nuanced understanding of model performance beyond traditional n-gram metrics.

How It Works

L-Eval employs a multi-faceted evaluation approach. For closed-ended tasks (e.g., multiple-choice, math problems), it uses exact match metrics. For open-ended tasks (e.g., summarization, question answering), it moves beyond standard ROUGE and F1 scores, primarily utilizing Length-Instruction-Enhanced (LIE) evaluation and LLM judges (GPT-4, Turbo-16k) to better capture nuanced generation quality and reduce length bias. This hybrid approach aims for more reliable and informative assessments of LCLMs.

Quick Start & Requirements

  • Install/Run: Load data via Hugging Face Datasets (load_dataset('L4NLP/LEval', testset, split='test')) or clone the repository. Evaluation scripts are provided in the Evaluation/ directory.
  • Prerequisites: Python >= 3.8, PyTorch (e.g., 1.13.1+cu117), CUDA Toolkit (e.g., 11.7). Flash Attention v2 is recommended for memory efficiency. Optional: Elasticsearch for BM25 retrieval, API keys for Ada embedding.
  • Resources: Baseline model testing typically requires an 80G A800 GPU. Memory-efficient inference with LightLLM is supported for 24G GPUs.
  • Links: HuggingFace Datasets, Leaderboard, Paper

Highlighted Details

  • ACL'24 Outstanding Paper award.
  • Supports evaluation using LLM judges (GPT-4, Turbo-16k) and human annotation via a Flask web app.
  • Includes scripts for retrieval-based baselines using Langchain.
  • Offers memory-efficient inference options with Flash Attention and LightLLM.

Maintenance & Community

  • Supported by OpenCompass.
  • Primary contributors from Fudan University and The University of Hong Kong.
  • Contact: cxan20@fudan.edu.cn

Licensing & Compatibility

  • The repository itself does not explicitly state a license. However, it relies on and cites numerous other open-source datasets, each with its own license. Users should verify compatibility for commercial use based on the underlying dataset licenses.

Limitations & Caveats

  • Some datasets (e.g., codeU, sci_fi) may require disabling Hugging Face caching for download.
  • LightLLM server processes might not terminate cleanly.
  • The README mentions potential CUDA OOM issues and provides mitigation strategies, indicating that running all tasks may require significant GPU resources.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.