code-eval  by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

created 2 years ago
416 stars

Top 71.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a framework for evaluating Large Language Models (LLMs) on the HumanEval benchmark, specifically focusing on code generation capabilities. It's designed for researchers and developers who need to benchmark and compare the performance of various code-generation models, offering reproducible results and insights into model strengths.

How It Works

The project leverages the HumanEval benchmark to assess LLM performance. It includes scripts tailored for different model architectures and output formats, handling variations in tokenization and loading. A key aspect is the post-processing of model outputs, particularly for base models that might repeat tokens, to ensure accurate benchmark scoring. The evaluation process aims to reproduce official scores where possible, providing a standardized method for comparison.

Quick Start & Requirements

  • Install: Create a Python virtual environment (python -m venv env && source env/bin/activate) and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.x.
  • Running Eval: Execute model-specific evaluation scripts (e.g., python eval_wizard.py).
  • Processing: Use process_eval.py for models outputting markdown code (e.g., WizardCoder, OpenCoder).
  • Results: Run evaluate_functional_correctness with processed JSONL files.
  • Docs: [Not explicitly linked, but scripts serve as examples.]

Highlighted Details

  • Provides a sorted results table with pass@1 and pass@10 scores for various models.
  • Addresses score discrepancies by attempting to reproduce official evaluation prompts and processing.
  • Includes a filter_code post-generation step for base models to mitigate output repetition.
  • Scripts are adapted from the WizardCoder repository.

Maintenance & Community

  • The repository is maintained by abacaj.
  • No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The repository is described as a personal utility with code duplication for handling edge cases, indicating potential areas for refactoring. The README also notes that some models require specific post-processing steps to achieve accurate results, implying a need for careful configuration.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.