ML-Bench  by gersteinlab

Benchmark for evaluating LLMs/agents on machine learning tasks using repository-level code

created 1 year ago
302 stars

Top 89.3% on sourcepulse

GitHubView on GitHub
Project Summary

ML-Bench provides a framework for evaluating Large Language Models (LLMs) and agents on repository-level code tasks, targeting ML researchers and developers. It aims to benchmark LLM performance in understanding and generating code within the context of entire software repositories.

How It Works

ML-Bench utilizes a dataset derived from GitHub repositories, featuring tasks that require LLMs to generate code snippets based on descriptions, retrieved information, or "oracle" code segments. The framework includes scripts for data preparation, running evaluations against LLMs (including OpenAI models and open-source alternatives like CodeLlama), and fine-tuning models. It supports processing READMEs of varying lengths (up to 128k tokens) to simulate real-world code understanding scenarios.

Quick Start & Requirements

  • Install: git clone --recurse-submodules https://github.com/gersteinlab/ML-Bench.git followed by pip install -r requirements.txt.
  • Prerequisites: Python, datasets library. Docker is recommended for environment setup.
  • Data: Load via datasets.load_dataset("super-dainiu/ml-bench"). Post-processing scripts (scripts/post_process/prepare.sh) are required for benchmark generation.
  • Docker: docker pull public.ecr.aws/i5g0m1f6/ml-bench and docker run -it -v ML_Bench:/deep_data public.ecr.aws/i5g0m1f6/ml-bench /bin/bash.
  • Model Weights: bash utils/download_model_weight_pics.sh (approx. 2 hours).
  • Links: Dataset, OpenAI Script, Fine-tuning, ML-Agent-Bench.

Highlighted Details

  • Evaluates LLMs on repository-level code tasks, including generation from descriptions, retrieval, and oracle context.
  • Supports fine-tuning of open-source models like CodeLlama using Llama-recipes.
  • Includes scripts to reproduce OpenAI's GPT-3.5 and GPT-4 performance.
  • Offers Docker images for streamlined environment setup for both ML-LLM-Bench and ML-Agent-Bench.

Maintenance & Community

  • The project is associated with the gersteinlab.
  • Links to relevant research papers and setup guides are provided within the README.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Data preparation for large READMEs (128k tokens) can take up to 2 hours without parallelization, though optimized with 10 workers in 10 minutes.
  • Fine-tuning and inference scripts require specific parameter adjustments for model paths, task types, and dataset locations.
Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.