lm-evaluation-harness  by EleutherAI

Framework for few-shot language model evaluation

created 5 years ago
9,706 stars

Top 5.3% on sourcepulse

GitHubView on GitHub
Project Summary

This framework provides a unified system for evaluating generative language models across a wide array of academic benchmarks. It supports numerous model loading methods, including Hugging Face transformers, vLLM, and various API-based models, making it a versatile tool for researchers and developers assessing LLM performance.

How It Works

The harness employs a flexible, tokenization-agnostic interface to evaluate models on over 60 standard benchmarks with hundreds of subtasks. It supports advanced inference techniques like quantization (GPTQ, AutoGPTQ), vLLM for speed and memory efficiency, and multi-GPU parallelism via Hugging Face's Accelerate library. Prompt engineering is facilitated through Jinja2 templating and integration with Promptsource, allowing for customizable evaluation setups.

Quick Start & Requirements

  • Install via pip: git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness && cd lm-evaluation-harness && pip install -e .
  • Optional dependencies for extended functionality are available (e.g., pip install lm_eval[vllm]).
  • Requires Python and a compatible environment. GPU acceleration is highly recommended for performance.
  • Documentation: https://github.com/EleutherAI/lm-evaluation-harness#documentation

Highlighted Details

  • Backend for Hugging Face's Open LLM Leaderboard.
  • Supports evaluation on PEFT adapters (e.g., LoRA).
  • Integrates with Weights & Biases and Zeno for results visualization.
  • Includes experimental support for multimodal tasks and steering vectors.

Maintenance & Community

The project is actively maintained by EleutherAI, with contributions from numerous researchers and organizations. Support and discussion are available via GitHub issues and the EleutherAI Discord server.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Native multi-node evaluation is not supported for the Hugging Face hf model type; custom integrations or external servers are recommended. The MPS backend for Metal GPUs is in early development and may have correctness issues.

Health Check
Last commit

17 hours ago

Responsiveness

1 day

Pull Requests (30d)
66
Issues (30d)
48
Star History
946 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.