evalchemy by mlfoundations

LLM evaluation toolkit for post-trained language models

Created 1 year ago

574 stars

Top 56.2% on SourcePulse

View on GitHub

5 Experts Love This Project

Will Brown

Research Lead at Prime Intellect

Maxime Labonne

Head of Post-Training at Liquid AI

Casper Hansen

Author of AutoAWQ

Junyang Lin

Core Maintainer at Alibaba Qwen

and 1 more!

Project Summary

Evalchemy is a unified toolkit for evaluating large language models (LLMs), designed for researchers and developers needing to benchmark model performance across a wide range of tasks. It simplifies the process of setting up and running evaluations, offering parallel processing capabilities and standardized results management.

How It Works

Evalchemy builds upon the LM-Eval-Harness, providing a consistent interface for executing diverse benchmarks. It supports data and model parallelism for faster, scalable evaluations, and offers features like local results tracking, optional database integration for leaderboards, and the ability to swap LLM judges. This approach aims to reduce dependency conflicts and streamline the evaluation workflow.

Quick Start & Requirements

Installation: Recommended via Conda (conda create --name evalchemy python=3.10, conda activate evalchemy). Clone the repo, then pip install -e . and pip install -e eval/chat_benchmarks/alpaca_eval.
Prerequisites: Hugging Face login (huggingface-cli login) for datasets/models. CUDA 12.4 is tested; updates may be needed for older versions.
Setup Time: Minimal, primarily environment setup.
Links: LM Evaluation Harness, Evalchemy Blog Post, Distributed README.

Highlighted Details

Supports over 30 benchmarks, including reasoning (AIME25, MATH500), coding (HumanEvalPlus, BigCodeBench), and instruction following (MTBench, AlpacaEval).
Integrates with vLLM and Curator for broad model support, including API-based models.
Offers distributed evaluation across multiple nodes for significant speedups.
Provides detailed logging of model configuration, seeds, hardware, and timing.

Maintenance & Community

Developed by the DataComp community and Bespoke Labs. Contributions are welcomed via standard GitHub workflows. Citation details are provided.

Licensing & Compatibility

The primary license is not explicitly stated in the README, but it depends on LM-Eval-Harness and other dependencies. Compatibility for commercial use or closed-source linking would require careful review of all component licenses.

Limitations & Caveats

Some benchmarks, like BigCodeBench, require caution due to potential risks from executing LLM-generated code. ZeroEval benchmarks require requesting access to a private Hugging Face dataset and accepting terms. The README notes that some advanced features might require specific HPC cluster configurations.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days