Framework for few-shot language model evaluation
Top 5.3% on sourcepulse
This framework provides a unified system for evaluating generative language models across a wide array of academic benchmarks. It supports numerous model loading methods, including Hugging Face transformers, vLLM, and various API-based models, making it a versatile tool for researchers and developers assessing LLM performance.
How It Works
The harness employs a flexible, tokenization-agnostic interface to evaluate models on over 60 standard benchmarks with hundreds of subtasks. It supports advanced inference techniques like quantization (GPTQ, AutoGPTQ), vLLM for speed and memory efficiency, and multi-GPU parallelism via Hugging Face's Accelerate library. Prompt engineering is facilitated through Jinja2 templating and integration with Promptsource, allowing for customizable evaluation setups.
Quick Start & Requirements
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness && cd lm-evaluation-harness && pip install -e .
pip install lm_eval[vllm]
).Highlighted Details
Maintenance & Community
The project is actively maintained by EleutherAI, with contributions from numerous researchers and organizations. Support and discussion are available via GitHub issues and the EleutherAI Discord server.
Licensing & Compatibility
The project is licensed under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Native multi-node evaluation is not supported for the Hugging Face hf
model type; custom integrations or external servers are recommended. The MPS backend for Metal GPUs is in early development and may have correctness issues.
17 hours ago
1 day