mle-bench by openai

Benchmark for evaluating AI agents on machine learning engineering tasks

Created 1 year ago

1,270 stars

Top 31.1% on SourcePulse

View on GitHub

6 Experts Love This Project

Johannes Hagemann

Cofounder of Prime Intellect

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Casper Hansen

Author of AutoAWQ

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 2 more!

Project Summary

MLE-bench provides a comprehensive benchmark for evaluating AI agents on machine learning engineering tasks. It targets researchers and developers building AI agents, offering a standardized framework to measure their proficiency across various ML engineering challenges, with the benefit of reproducible and comparable performance metrics.

How It Works

MLE-bench comprises 75 Kaggle competitions, each with custom preparation and grading scripts. The benchmark evaluates AI agents by having them attempt these competitions, submitting solutions in CSV format. Submissions are then graded using provided scripts, allowing for quantitative assessment of the agents' ML engineering capabilities. The benchmark supports a "lite" version using a subset of competitions for faster evaluation.

Quick Start & Requirements

Install with pip install -e .
Requires Git LFS: git lfs fetch --all and git lfs pull
Kaggle API credentials (~/.kaggle/kaggle.json) are necessary for data preparation.
Dataset preparation: mlebench prepare --all (full dataset, ~2 days) or mlebench prepare --lite (lite dataset).
Docker image mlebench-env is available for a consistent environment.
Recommended evaluation resources: 24 hours runtime, 36 vCPUs, 440GB RAM, one 24GB A10 GPU.
Official documentation and examples are available within the repository.

Highlighted Details

Evaluates AI agents on 75 Kaggle ML engineering competitions.
Includes custom data preparation and grading scripts for each competition.
Supports a "lite" evaluation set for reduced computational cost.
Provides a Docker environment (mlebench-env) for consistent setup.
Offers optional extras like rule and plagiarism detectors.

Maintenance & Community

The project is associated with authors from institutions including MIT. The primary publication is available on arXiv. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The full dataset preparation can take up to two days. The README notes that reducing compute resources may risk degrading agent performance. Some competition tests may fail if competition rules are not accepted locally during pytest execution.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

58 stars in the last 30 days