code-eval by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

Created 2 years ago

426 stars

Top 69.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

This repository provides a framework for evaluating Large Language Models (LLMs) on the HumanEval benchmark, specifically focusing on code generation capabilities. It's designed for researchers and developers who need to benchmark and compare the performance of various code-generation models, offering reproducible results and insights into model strengths.

How It Works

The project leverages the HumanEval benchmark to assess LLM performance. It includes scripts tailored for different model architectures and output formats, handling variations in tokenization and loading. A key aspect is the post-processing of model outputs, particularly for base models that might repeat tokens, to ensure accurate benchmark scoring. The evaluation process aims to reproduce official scores where possible, providing a standardized method for comparison.

Quick Start & Requirements

Install: Create a Python virtual environment (python -m venv env && source env/bin/activate) and install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.x.
Running Eval: Execute model-specific evaluation scripts (e.g., python eval_wizard.py).
Processing: Use process_eval.py for models outputting markdown code (e.g., WizardCoder, OpenCoder).
Results: Run evaluate_functional_correctness with processed JSONL files.
Docs: [Not explicitly linked, but scripts serve as examples.]

Highlighted Details

Provides a sorted results table with pass@1 and pass@10 scores for various models.
Addresses score discrepancies by attempting to reproduce official evaluation prompts and processing.
Includes a filter_code post-generation step for base models to mitigate output repetition.
Scripts are adapted from the WizardCoder repository.

Maintenance & Community

The repository is maintained by abacaj.
No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not specify a license.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The repository is described as a personal utility with code duplication for handling edge cases, indicating potential areas for refactoring. The README also notes that some models require specific post-processing steps to achieve accurate results, implying a need for careful configuration.

code-eval by abacaj

Explore Similar Projects

llm_benchmark by llm2014

ML-Bench by gersteinlab

phasellm by wgryc

continuous-eval by relari-ai

instruct-eval by declare-lab

MultiPL-E by nuprl

bigcodebench by bigcode-project

prometheus-eval by prometheus-eval

Awesome-Code-LLM by huybery

LiveCodeBench by LiveCodeBench

human-eval by openai

SWE-bench by SWE-bench