human-eval by openai

Evaluation harness for LLMs trained on code

Created 4 years ago

3,077 stars

Top 15.4% on SourcePulse

View on GitHub

5 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jared Palmer

SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX

Kevin Hou

Head of Product Engineering at Windsurf

Jiaming Song

Chief Scientist at Luma AI

and 1 more!

Project Summary

This repository provides the HumanEval dataset and evaluation harness for assessing the code generation capabilities of large language models. It is designed for researchers and developers working on AI models for programming tasks, enabling standardized benchmarking of model performance on functional correctness.

How It Works

The harness executes model-generated code against a suite of unit tests for each programming problem. It measures functional correctness using metrics like pass@k, which represents the probability that at least one of k generated solutions passes the tests. The evaluation process is designed to be secure, with a strong emphasis on running untrusted code within a sandboxed environment.

Quick Start & Requirements

Install via pip install -e human-eval.
Requires Python 3.7+.
Usage involves generating code completions in JSON Lines format and then running evaluate_functional_correctness <your_samples.jsonl>.
Official documentation and examples are available within the repository.

Highlighted Details

Evaluates functional correctness of code generation models.
Supports pass@k metrics for robust evaluation.
Includes a dataset of programming problems with unit tests.
Emphasizes secure execution of untrusted model-generated code.

Maintenance & Community

This project originates from OpenAI and is associated with the "Evaluating Large Language Models Trained on Code" paper. No specific community channels or active maintenance signals are detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, given its origin and purpose, it is likely intended for research and non-commercial use, but commercial compatibility would require explicit confirmation.

Limitations & Caveats

The README warns users against running untrusted code outside a robust security sandbox. It also notes a potential malloc error related to RAM limitations, which could cause correct programs to fail. The evaluation of pass@k is not performed when the number of samples is less than k.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days