Evaluation harness for LLMs trained on code
Top 17.1% on sourcepulse
This repository provides the HumanEval dataset and evaluation harness for assessing the code generation capabilities of large language models. It is designed for researchers and developers working on AI models for programming tasks, enabling standardized benchmarking of model performance on functional correctness.
How It Works
The harness executes model-generated code against a suite of unit tests for each programming problem. It measures functional correctness using metrics like pass@k
, which represents the probability that at least one of k
generated solutions passes the tests. The evaluation process is designed to be secure, with a strong emphasis on running untrusted code within a sandboxed environment.
Quick Start & Requirements
pip install -e human-eval
.evaluate_functional_correctness <your_samples.jsonl>
.Highlighted Details
pass@k
metrics for robust evaluation.Maintenance & Community
This project originates from OpenAI and is associated with the "Evaluating Large Language Models Trained on Code" paper. No specific community channels or active maintenance signals are detailed in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. However, given its origin and purpose, it is likely intended for research and non-commercial use, but commercial compatibility would require explicit confirmation.
Limitations & Caveats
The README warns users against running untrusted code outside a robust security sandbox. It also notes a potential malloc
error related to RAM limitations, which could cause correct programs to fail. The evaluation of pass@k
is not performed when the number of samples is less than k
.
6 months ago
Inactive