human-eval  by openai

Evaluation harness for LLMs trained on code

created 4 years ago
2,855 stars

Top 17.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the HumanEval dataset and evaluation harness for assessing the code generation capabilities of large language models. It is designed for researchers and developers working on AI models for programming tasks, enabling standardized benchmarking of model performance on functional correctness.

How It Works

The harness executes model-generated code against a suite of unit tests for each programming problem. It measures functional correctness using metrics like pass@k, which represents the probability that at least one of k generated solutions passes the tests. The evaluation process is designed to be secure, with a strong emphasis on running untrusted code within a sandboxed environment.

Quick Start & Requirements

  • Install via pip install -e human-eval.
  • Requires Python 3.7+.
  • Usage involves generating code completions in JSON Lines format and then running evaluate_functional_correctness <your_samples.jsonl>.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Evaluates functional correctness of code generation models.
  • Supports pass@k metrics for robust evaluation.
  • Includes a dataset of programming problems with unit tests.
  • Emphasizes secure execution of untrusted model-generated code.

Maintenance & Community

This project originates from OpenAI and is associated with the "Evaluating Large Language Models Trained on Code" paper. No specific community channels or active maintenance signals are detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, given its origin and purpose, it is likely intended for research and non-commercial use, but commercial compatibility would require explicit confirmation.

Limitations & Caveats

The README warns users against running untrusted code outside a robust security sandbox. It also notes a potential malloc error related to RAM limitations, which could cause correct programs to fail. The evaluation of pass@k is not performed when the number of samples is less than k.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
150 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
4 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
created 1 year ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena).

evalplus by evalplus

0.5%
2k
LLM code evaluation framework for rigorous testing
created 2 years ago
updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
3 more.

AlphaCodium by Codium-ai

0.2%
4k
Code generation research paper implementation
created 1 year ago
updated 8 months ago
Feedback? Help us improve.