MLAgentBench  by snap-stanford

ML experimentation benchmark for evaluating language agents

Created 2 years ago
306 stars

Top 87.5% on SourcePulse

GitHubView on GitHub
Project Summary

MLAgentBench provides a framework for evaluating AI agents on end-to-end machine learning experimentation. It targets AI researchers and developers seeking to benchmark autonomous ML workflows, offering a standardized environment for agents to tackle diverse ML tasks from dataset preparation to model optimization.

How It Works

MLAgentBench simulates real-world ML research environments, presenting agents with datasets and task descriptions. Agents interact by reading files, executing experiments on a compute cluster, and analyzing results. This approach allows for direct comparison of agent capabilities across 13 distinct ML engineering tasks, mimicking the iterative process human researchers follow.

Quick Start & Requirements

  • Installation: pip install -e . or use the provided Docker image (qhwang123/researchassistant:latest).
  • Prerequisites: Python 3.10, Docker (recommended for sandboxing), Kaggle API setup for Kaggle datasets, OpenAI, Claude, CRFM, Gemini Pro, or Hugging Face API keys.
  • Setup: Kaggle API keys require placement in .kaggle/kaggle.json and setting KAGGLE_CONFIG_DIR. LLM API keys need to be in openai_api_key.txt, claude_api_key.txt, or crfm_api_key.txt.
  • Example Run: python -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test --work-dir workspace --llm-name gpt-4 --edit-script-llm-name gpt-4 --fast-llm-name gpt-3.5-turbo
  • Documentation: https://arxiv.org/abs/2310.03302

Highlighted Details

  • Supports 13 diverse ML engineering tasks, covering various aspects of ML experimentation.
  • Agents can be benchmarked against baseline performance metrics like Success Rate and Average Improvement.
  • Includes support for multiple agent frameworks (e.g., Langchain, AutoGPT) and LLMs (OpenAI, Claude, Gemini, Hugging Face).
  • Provides detailed logging and evaluation tools for systematic analysis and reproduction of results.

Maintenance & Community

The project is associated with Stanford University. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The interactive mode is noted as "under construction." Some workflow scripts (run_experiments.sh, baseline.sh, eval.sh) may require manual path and name adjustments (marked with TODO).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.