Benchmark for evaluating AI agents on machine learning engineering tasks
Top 44.2% on sourcepulse
MLE-bench provides a comprehensive benchmark for evaluating AI agents on machine learning engineering tasks. It targets researchers and developers building AI agents, offering a standardized framework to measure their proficiency across various ML engineering challenges, with the benefit of reproducible and comparable performance metrics.
How It Works
MLE-bench comprises 75 Kaggle competitions, each with custom preparation and grading scripts. The benchmark evaluates AI agents by having them attempt these competitions, submitting solutions in CSV format. Submissions are then graded using provided scripts, allowing for quantitative assessment of the agents' ML engineering capabilities. The benchmark supports a "lite" version using a subset of competitions for faster evaluation.
Quick Start & Requirements
pip install -e .
git lfs fetch --all
and git lfs pull
~/.kaggle/kaggle.json
) are necessary for data preparation.mlebench prepare --all
(full dataset, ~2 days) or mlebench prepare --lite
(lite dataset).mlebench-env
is available for a consistent environment.Highlighted Details
mlebench-env
) for consistent setup.Maintenance & Community
The project is associated with authors from institutions including MIT. The primary publication is available on arXiv. Links to community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The full dataset preparation can take up to two days. The README notes that reducing compute resources may risk degrading agent performance. Some competition tests may fail if competition rules are not accepted locally during pytest
execution.
1 month ago
1 day