evalchemy  by mlfoundations

LLM evaluation toolkit for post-trained language models

created 8 months ago
492 stars

Top 63.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Evalchemy is a unified toolkit for evaluating large language models (LLMs), designed for researchers and developers needing to benchmark model performance across a wide range of tasks. It simplifies the process of setting up and running evaluations, offering parallel processing capabilities and standardized results management.

How It Works

Evalchemy builds upon the LM-Eval-Harness, providing a consistent interface for executing diverse benchmarks. It supports data and model parallelism for faster, scalable evaluations, and offers features like local results tracking, optional database integration for leaderboards, and the ability to swap LLM judges. This approach aims to reduce dependency conflicts and streamline the evaluation workflow.

Quick Start & Requirements

  • Installation: Recommended via Conda (conda create --name evalchemy python=3.10, conda activate evalchemy). Clone the repo, then pip install -e . and pip install -e eval/chat_benchmarks/alpaca_eval.
  • Prerequisites: Hugging Face login (huggingface-cli login) for datasets/models. CUDA 12.4 is tested; updates may be needed for older versions.
  • Setup Time: Minimal, primarily environment setup.
  • Links: LM Evaluation Harness, Evalchemy Blog Post, Distributed README.

Highlighted Details

  • Supports over 30 benchmarks, including reasoning (AIME25, MATH500), coding (HumanEvalPlus, BigCodeBench), and instruction following (MTBench, AlpacaEval).
  • Integrates with vLLM and Curator for broad model support, including API-based models.
  • Offers distributed evaluation across multiple nodes for significant speedups.
  • Provides detailed logging of model configuration, seeds, hardware, and timing.

Maintenance & Community

Developed by the DataComp community and Bespoke Labs. Contributions are welcomed via standard GitHub workflows. Citation details are provided.

Licensing & Compatibility

The primary license is not explicitly stated in the README, but it depends on LM-Eval-Harness and other dependencies. Compatibility for commercial use or closed-source linking would require careful review of all component licenses.

Limitations & Caveats

Some benchmarks, like BigCodeBench, require caution due to potential risks from executing LLM-generated code. ZeroEval benchmarks require requesting access to a private Hugging Face dataset and accepting terms. The README notes that some advanced features might require specific HPC cluster configurations.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
120 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.