mle-bench  by openai

Benchmark for evaluating AI agents on machine learning engineering tasks

created 9 months ago
819 stars

Top 44.2% on sourcepulse

GitHubView on GitHub
Project Summary

MLE-bench provides a comprehensive benchmark for evaluating AI agents on machine learning engineering tasks. It targets researchers and developers building AI agents, offering a standardized framework to measure their proficiency across various ML engineering challenges, with the benefit of reproducible and comparable performance metrics.

How It Works

MLE-bench comprises 75 Kaggle competitions, each with custom preparation and grading scripts. The benchmark evaluates AI agents by having them attempt these competitions, submitting solutions in CSV format. Submissions are then graded using provided scripts, allowing for quantitative assessment of the agents' ML engineering capabilities. The benchmark supports a "lite" version using a subset of competitions for faster evaluation.

Quick Start & Requirements

  • Install with pip install -e .
  • Requires Git LFS: git lfs fetch --all and git lfs pull
  • Kaggle API credentials (~/.kaggle/kaggle.json) are necessary for data preparation.
  • Dataset preparation: mlebench prepare --all (full dataset, ~2 days) or mlebench prepare --lite (lite dataset).
  • Docker image mlebench-env is available for a consistent environment.
  • Recommended evaluation resources: 24 hours runtime, 36 vCPUs, 440GB RAM, one 24GB A10 GPU.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Evaluates AI agents on 75 Kaggle ML engineering competitions.
  • Includes custom data preparation and grading scripts for each competition.
  • Supports a "lite" evaluation set for reduced computational cost.
  • Provides a Docker environment (mlebench-env) for consistent setup.
  • Offers optional extras like rule and plagiarism detectors.

Maintenance & Community

The project is associated with authors from institutions including MIT. The primary publication is available on arXiv. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The full dataset preparation can take up to two days. The README notes that reducing compute resources may risk degrading agent performance. Some competition tests may fail if competition rules are not accepted locally during pytest execution.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
2
Star History
136 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

SWE-Gym by SWE-Gym

1.0%
513
Environment for training software engineering agents
created 9 months ago
updated 4 days ago
Feedback? Help us improve.