ML-Bench by gersteinlab

Benchmark for evaluating LLMs/agents on machine learning tasks using repository-level code

Created 2 years ago

318 stars

Top 85.2% on SourcePulse

Project Summary

ML-Bench provides a framework for evaluating Large Language Models (LLMs) and agents on repository-level code tasks, targeting ML researchers and developers. It aims to benchmark LLM performance in understanding and generating code within the context of entire software repositories.

How It Works

ML-Bench utilizes a dataset derived from GitHub repositories, featuring tasks that require LLMs to generate code snippets based on descriptions, retrieved information, or "oracle" code segments. The framework includes scripts for data preparation, running evaluations against LLMs (including OpenAI models and open-source alternatives like CodeLlama), and fine-tuning models. It supports processing READMEs of varying lengths (up to 128k tokens) to simulate real-world code understanding scenarios.

Quick Start & Requirements

Install: git clone --recurse-submodules https://github.com/gersteinlab/ML-Bench.git followed by pip install -r requirements.txt.
Prerequisites: Python, datasets library. Docker is recommended for environment setup.
Data: Load via datasets.load_dataset("super-dainiu/ml-bench"). Post-processing scripts (scripts/post_process/prepare.sh) are required for benchmark generation.
Docker: docker pull public.ecr.aws/i5g0m1f6/ml-bench and docker run -it -v ML_Bench:/deep_data public.ecr.aws/i5g0m1f6/ml-bench /bin/bash.
Model Weights: bash utils/download_model_weight_pics.sh (approx. 2 hours).
Links: Dataset, OpenAI Script, Fine-tuning, ML-Agent-Bench.

Highlighted Details

Evaluates LLMs on repository-level code tasks, including generation from descriptions, retrieval, and oracle context.
Supports fine-tuning of open-source models like CodeLlama using Llama-recipes.
Includes scripts to reproduce OpenAI's GPT-3.5 and GPT-4 performance.
Offers Docker images for streamlined environment setup for both ML-LLM-Bench and ML-Agent-Bench.

Maintenance & Community

The project is associated with the gersteinlab.
Links to relevant research papers and setup guides are provided within the README.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Data preparation for large READMEs (128k tokens) can take up to 2 hours without parallelization, though optimized with 10 workers in 10 minutes.
Fine-tuning and inference scripts require specific parameter adjustments for model paths, task types, and dataset locations.

Health Check

Last Commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days