human-eval  by openai

Evaluation harness for LLMs trained on code

Created 4 years ago
2,933 stars

Top 16.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the HumanEval dataset and evaluation harness for assessing the code generation capabilities of large language models. It is designed for researchers and developers working on AI models for programming tasks, enabling standardized benchmarking of model performance on functional correctness.

How It Works

The harness executes model-generated code against a suite of unit tests for each programming problem. It measures functional correctness using metrics like pass@k, which represents the probability that at least one of k generated solutions passes the tests. The evaluation process is designed to be secure, with a strong emphasis on running untrusted code within a sandboxed environment.

Quick Start & Requirements

  • Install via pip install -e human-eval.
  • Requires Python 3.7+.
  • Usage involves generating code completions in JSON Lines format and then running evaluate_functional_correctness <your_samples.jsonl>.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Evaluates functional correctness of code generation models.
  • Supports pass@k metrics for robust evaluation.
  • Includes a dataset of programming problems with unit tests.
  • Emphasizes secure execution of untrusted model-generated code.

Maintenance & Community

This project originates from OpenAI and is associated with the "Evaluating Large Language Models Trained on Code" paper. No specific community channels or active maintenance signals are detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, given its origin and purpose, it is likely intended for research and non-commercial use, but commercial compatibility would require explicit confirmation.

Limitations & Caveats

The README warns users against running untrusted code outside a robust security sandbox. It also notes a potential malloc error related to RAM limitations, which could cause correct programs to fail. The evaluation of pass@k is not performed when the number of samples is less than k.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
54 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

AlphaCodium by Codium-ai

0.1%
4k
Code generation research paper implementation
Created 1 year ago
Updated 9 months ago
Feedback? Help us improve.