Benchmark for evaluating LLMs/agents on machine learning tasks using repository-level code
Top 89.3% on sourcepulse
ML-Bench provides a framework for evaluating Large Language Models (LLMs) and agents on repository-level code tasks, targeting ML researchers and developers. It aims to benchmark LLM performance in understanding and generating code within the context of entire software repositories.
How It Works
ML-Bench utilizes a dataset derived from GitHub repositories, featuring tasks that require LLMs to generate code snippets based on descriptions, retrieved information, or "oracle" code segments. The framework includes scripts for data preparation, running evaluations against LLMs (including OpenAI models and open-source alternatives like CodeLlama), and fine-tuning models. It supports processing READMEs of varying lengths (up to 128k tokens) to simulate real-world code understanding scenarios.
Quick Start & Requirements
git clone --recurse-submodules https://github.com/gersteinlab/ML-Bench.git
followed by pip install -r requirements.txt
.datasets
library. Docker is recommended for environment setup.datasets.load_dataset("super-dainiu/ml-bench")
. Post-processing scripts (scripts/post_process/prepare.sh
) are required for benchmark generation.docker pull public.ecr.aws/i5g0m1f6/ml-bench
and docker run -it -v ML_Bench:/deep_data public.ecr.aws/i5g0m1f6/ml-bench /bin/bash
.bash utils/download_model_weight_pics.sh
(approx. 2 hours).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 days ago
Inactive