code-eval  by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

Created 2 years ago
419 stars

Top 70.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a framework for evaluating Large Language Models (LLMs) on the HumanEval benchmark, specifically focusing on code generation capabilities. It's designed for researchers and developers who need to benchmark and compare the performance of various code-generation models, offering reproducible results and insights into model strengths.

How It Works

The project leverages the HumanEval benchmark to assess LLM performance. It includes scripts tailored for different model architectures and output formats, handling variations in tokenization and loading. A key aspect is the post-processing of model outputs, particularly for base models that might repeat tokens, to ensure accurate benchmark scoring. The evaluation process aims to reproduce official scores where possible, providing a standardized method for comparison.

Quick Start & Requirements

  • Install: Create a Python virtual environment (python -m venv env && source env/bin/activate) and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.x.
  • Running Eval: Execute model-specific evaluation scripts (e.g., python eval_wizard.py).
  • Processing: Use process_eval.py for models outputting markdown code (e.g., WizardCoder, OpenCoder).
  • Results: Run evaluate_functional_correctness with processed JSONL files.
  • Docs: [Not explicitly linked, but scripts serve as examples.]

Highlighted Details

  • Provides a sorted results table with pass@1 and pass@10 scores for various models.
  • Addresses score discrepancies by attempting to reproduce official evaluation prompts and processing.
  • Includes a filter_code post-generation step for base models to mitigate output repetition.
  • Scripts are adapted from the WizardCoder repository.

Maintenance & Community

  • The repository is maintained by abacaj.
  • No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The repository is described as a personal utility with code duplication for handling edge cases, indicating potential areas for refactoring. The README also notes that some models require specific post-processing steps to achieve accurate results, implying a need for careful configuration.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.