MLAgentBench  by snap-stanford

ML experimentation benchmark for evaluating language agents

Created 2 years ago
330 stars

Top 83.2% on SourcePulse

GitHubView on GitHub
Project Summary

MLAgentBench provides a framework for evaluating AI agents on end-to-end machine learning experimentation. It targets AI researchers and developers seeking to benchmark autonomous ML workflows, offering a standardized environment for agents to tackle diverse ML tasks from dataset preparation to model optimization.

How It Works

MLAgentBench simulates real-world ML research environments, presenting agents with datasets and task descriptions. Agents interact by reading files, executing experiments on a compute cluster, and analyzing results. This approach allows for direct comparison of agent capabilities across 13 distinct ML engineering tasks, mimicking the iterative process human researchers follow.

Quick Start & Requirements

  • Installation: pip install -e . or use the provided Docker image (qhwang123/researchassistant:latest).
  • Prerequisites: Python 3.10, Docker (recommended for sandboxing), Kaggle API setup for Kaggle datasets, OpenAI, Claude, CRFM, Gemini Pro, or Hugging Face API keys.
  • Setup: Kaggle API keys require placement in .kaggle/kaggle.json and setting KAGGLE_CONFIG_DIR. LLM API keys need to be in openai_api_key.txt, claude_api_key.txt, or crfm_api_key.txt.
  • Example Run: python -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test --work-dir workspace --llm-name gpt-4 --edit-script-llm-name gpt-4 --fast-llm-name gpt-3.5-turbo
  • Documentation: https://arxiv.org/abs/2310.03302

Highlighted Details

  • Supports 13 diverse ML engineering tasks, covering various aspects of ML experimentation.
  • Agents can be benchmarked against baseline performance metrics like Success Rate and Average Improvement.
  • Includes support for multiple agent frameworks (e.g., Langchain, AutoGPT) and LLMs (OpenAI, Claude, Gemini, Hugging Face).
  • Provides detailed logging and evaluation tools for systematic analysis and reproduction of results.

Maintenance & Community

The project is associated with Stanford University. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The interactive mode is noted as "under construction." Some workflow scripts (run_experiments.sh, baseline.sh, eval.sh) may require manual path and name adjustments (marked with TODO).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

NeMo-Agent-Toolkit by NVIDIA

0.5%
2k
Open-source library for connecting and optimizing teams of AI agents
Created 11 months ago
Updated 20 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
1 more.

AgentLaboratory by SamuelSchmidgall

0.3%
5k
Agentic framework for autonomous research workflows
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.