evalchemy  by mlfoundations

LLM evaluation toolkit for post-trained language models

Created 10 months ago
526 stars

Top 60.1% on SourcePulse

GitHubView on GitHub
Project Summary

Evalchemy is a unified toolkit for evaluating large language models (LLMs), designed for researchers and developers needing to benchmark model performance across a wide range of tasks. It simplifies the process of setting up and running evaluations, offering parallel processing capabilities and standardized results management.

How It Works

Evalchemy builds upon the LM-Eval-Harness, providing a consistent interface for executing diverse benchmarks. It supports data and model parallelism for faster, scalable evaluations, and offers features like local results tracking, optional database integration for leaderboards, and the ability to swap LLM judges. This approach aims to reduce dependency conflicts and streamline the evaluation workflow.

Quick Start & Requirements

  • Installation: Recommended via Conda (conda create --name evalchemy python=3.10, conda activate evalchemy). Clone the repo, then pip install -e . and pip install -e eval/chat_benchmarks/alpaca_eval.
  • Prerequisites: Hugging Face login (huggingface-cli login) for datasets/models. CUDA 12.4 is tested; updates may be needed for older versions.
  • Setup Time: Minimal, primarily environment setup.
  • Links: LM Evaluation Harness, Evalchemy Blog Post, Distributed README.

Highlighted Details

  • Supports over 30 benchmarks, including reasoning (AIME25, MATH500), coding (HumanEvalPlus, BigCodeBench), and instruction following (MTBench, AlpacaEval).
  • Integrates with vLLM and Curator for broad model support, including API-based models.
  • Offers distributed evaluation across multiple nodes for significant speedups.
  • Provides detailed logging of model configuration, seeds, hardware, and timing.

Maintenance & Community

Developed by the DataComp community and Bespoke Labs. Contributions are welcomed via standard GitHub workflows. Citation details are provided.

Licensing & Compatibility

The primary license is not explicitly stated in the README, but it depends on LM-Eval-Harness and other dependencies. Compatibility for commercial use or closed-source linking would require careful review of all component licenses.

Limitations & Caveats

Some benchmarks, like BigCodeBench, require caution due to potential risks from executing LLM-generated code. ZeroEval benchmarks require requesting access to a private Hugging Face dataset and accepting terms. The README notes that some advanced features might require specific HPC cluster configurations.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.