ML-Bench  by gersteinlab

Benchmark for evaluating LLMs/agents on machine learning tasks using repository-level code

Created 1 year ago
302 stars

Top 88.4% on SourcePulse

GitHubView on GitHub
Project Summary

ML-Bench provides a framework for evaluating Large Language Models (LLMs) and agents on repository-level code tasks, targeting ML researchers and developers. It aims to benchmark LLM performance in understanding and generating code within the context of entire software repositories.

How It Works

ML-Bench utilizes a dataset derived from GitHub repositories, featuring tasks that require LLMs to generate code snippets based on descriptions, retrieved information, or "oracle" code segments. The framework includes scripts for data preparation, running evaluations against LLMs (including OpenAI models and open-source alternatives like CodeLlama), and fine-tuning models. It supports processing READMEs of varying lengths (up to 128k tokens) to simulate real-world code understanding scenarios.

Quick Start & Requirements

  • Install: git clone --recurse-submodules https://github.com/gersteinlab/ML-Bench.git followed by pip install -r requirements.txt.
  • Prerequisites: Python, datasets library. Docker is recommended for environment setup.
  • Data: Load via datasets.load_dataset("super-dainiu/ml-bench"). Post-processing scripts (scripts/post_process/prepare.sh) are required for benchmark generation.
  • Docker: docker pull public.ecr.aws/i5g0m1f6/ml-bench and docker run -it -v ML_Bench:/deep_data public.ecr.aws/i5g0m1f6/ml-bench /bin/bash.
  • Model Weights: bash utils/download_model_weight_pics.sh (approx. 2 hours).
  • Links: Dataset, OpenAI Script, Fine-tuning, ML-Agent-Bench.

Highlighted Details

  • Evaluates LLMs on repository-level code tasks, including generation from descriptions, retrieval, and oracle context.
  • Supports fine-tuning of open-source models like CodeLlama using Llama-recipes.
  • Includes scripts to reproduce OpenAI's GPT-3.5 and GPT-4 performance.
  • Offers Docker images for streamlined environment setup for both ML-LLM-Bench and ML-Agent-Bench.

Maintenance & Community

  • The project is associated with the gersteinlab.
  • Links to relevant research papers and setup guides are provided within the README.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Data preparation for large READMEs (128k tokens) can take up to 2 hours without parallelization, though optimized with 10 workers in 10 minutes.
  • Fine-tuning and inference scripts require specific parameter adjustments for model paths, task types, and dataset locations.
Health Check
Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Omar Khattab Omar Khattab(Coauthor of DSPy, ColBERT; Professor at MIT), and
5 more.

CodeXGLUE by microsoft

0.3%
2k
Benchmark for code intelligence tasks
Created 5 years ago
Updated 1 year ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.