Evaluation harness for LLMs using the HumanEval benchmark
Top 71.5% on sourcepulse
This repository provides a framework for evaluating Large Language Models (LLMs) on the HumanEval benchmark, specifically focusing on code generation capabilities. It's designed for researchers and developers who need to benchmark and compare the performance of various code-generation models, offering reproducible results and insights into model strengths.
How It Works
The project leverages the HumanEval benchmark to assess LLM performance. It includes scripts tailored for different model architectures and output formats, handling variations in tokenization and loading. A key aspect is the post-processing of model outputs, particularly for base models that might repeat tokens, to ensure accurate benchmark scoring. The evaluation process aims to reproduce official scores where possible, providing a standardized method for comparison.
Quick Start & Requirements
python -m venv env && source env/bin/activate
) and install dependencies (pip install -r requirements.txt
).python eval_wizard.py
).process_eval.py
for models outputting markdown code (e.g., WizardCoder, OpenCoder).evaluate_functional_correctness
with processed JSONL files.Highlighted Details
filter_code
post-generation step for base models to mitigate output repetition.Maintenance & Community
abacaj
.Licensing & Compatibility
Limitations & Caveats
The repository is described as a personal utility with code duplication for handling edge cases, indicating potential areas for refactoring. The README also notes that some models require specific post-processing steps to achieve accurate results, implying a need for careful configuration.
1 year ago
1 day