Benchmark for long-context language model evaluation
Top 75.0% on sourcepulse
L-Eval is a comprehensive benchmark suite designed to evaluate the capabilities of Long Context Language Models (LCLMs). It addresses the need for standardized evaluation across diverse tasks and document lengths (3k-200k tokens), targeting researchers and developers working with LCLMs. The benchmark offers 20 sub-tasks, over 500 documents, and 2,000 human-labeled query-response pairs, aiming to provide a more nuanced understanding of model performance beyond traditional n-gram metrics.
How It Works
L-Eval employs a multi-faceted evaluation approach. For closed-ended tasks (e.g., multiple-choice, math problems), it uses exact match metrics. For open-ended tasks (e.g., summarization, question answering), it moves beyond standard ROUGE and F1 scores, primarily utilizing Length-Instruction-Enhanced (LIE) evaluation and LLM judges (GPT-4, Turbo-16k) to better capture nuanced generation quality and reduce length bias. This hybrid approach aims for more reliable and informative assessments of LCLMs.
Quick Start & Requirements
load_dataset('L4NLP/LEval', testset, split='test')
) or clone the repository. Evaluation scripts are provided in the Evaluation/
directory.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
codeU
, sci_fi
) may require disabling Hugging Face caching for download.1 year ago
1 day