LLM evaluation toolkit for post-trained language models
Top 63.6% on sourcepulse
Evalchemy is a unified toolkit for evaluating large language models (LLMs), designed for researchers and developers needing to benchmark model performance across a wide range of tasks. It simplifies the process of setting up and running evaluations, offering parallel processing capabilities and standardized results management.
How It Works
Evalchemy builds upon the LM-Eval-Harness, providing a consistent interface for executing diverse benchmarks. It supports data and model parallelism for faster, scalable evaluations, and offers features like local results tracking, optional database integration for leaderboards, and the ability to swap LLM judges. This approach aims to reduce dependency conflicts and streamline the evaluation workflow.
Quick Start & Requirements
conda create --name evalchemy python=3.10
, conda activate evalchemy
). Clone the repo, then pip install -e .
and pip install -e eval/chat_benchmarks/alpaca_eval
.huggingface-cli login
) for datasets/models. CUDA 12.4 is tested; updates may be needed for older versions.Highlighted Details
Maintenance & Community
Developed by the DataComp community and Bespoke Labs. Contributions are welcomed via standard GitHub workflows. Citation details are provided.
Licensing & Compatibility
The primary license is not explicitly stated in the README, but it depends on LM-Eval-Harness and other dependencies. Compatibility for commercial use or closed-source linking would require careful review of all component licenses.
Limitations & Caveats
Some benchmarks, like BigCodeBench, require caution due to potential risks from executing LLM-generated code. ZeroEval benchmarks require requesting access to a private Hugging Face dataset and accepting terms. The README notes that some advanced features might require specific HPC cluster configurations.
1 month ago
1 week