MLAgentBench by snap-stanford

ML experimentation benchmark for evaluating language agents

Created 2 years ago

330 stars

Top 83.2% on SourcePulse

Project Summary

MLAgentBench provides a framework for evaluating AI agents on end-to-end machine learning experimentation. It targets AI researchers and developers seeking to benchmark autonomous ML workflows, offering a standardized environment for agents to tackle diverse ML tasks from dataset preparation to model optimization.

How It Works

MLAgentBench simulates real-world ML research environments, presenting agents with datasets and task descriptions. Agents interact by reading files, executing experiments on a compute cluster, and analyzing results. This approach allows for direct comparison of agent capabilities across 13 distinct ML engineering tasks, mimicking the iterative process human researchers follow.

Quick Start & Requirements

Installation: pip install -e . or use the provided Docker image (qhwang123/researchassistant:latest).
Prerequisites: Python 3.10, Docker (recommended for sandboxing), Kaggle API setup for Kaggle datasets, OpenAI, Claude, CRFM, Gemini Pro, or Hugging Face API keys.
Setup: Kaggle API keys require placement in .kaggle/kaggle.json and setting KAGGLE_CONFIG_DIR. LLM API keys need to be in openai_api_key.txt, claude_api_key.txt, or crfm_api_key.txt.
Example Run: python -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test --work-dir workspace --llm-name gpt-4 --edit-script-llm-name gpt-4 --fast-llm-name gpt-3.5-turbo
Documentation: https://arxiv.org/abs/2310.03302

Highlighted Details

Supports 13 diverse ML engineering tasks, covering various aspects of ML experimentation.
Agents can be benchmarked against baseline performance metrics like Success Rate and Average Improvement.
Includes support for multiple agent frameworks (e.g., Langchain, AutoGPT) and LLMs (OpenAI, Claude, Gemini, Hugging Face).
Provides detailed logging and evaluation tools for systematic analysis and reproduction of results.

Maintenance & Community

The project is associated with Stanford University. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The interactive mode is noted as "under construction." Some workflow scripts (run_experiments.sh, baseline.sh, eval.sh) may require manual path and name adjustments (marked with TODO).

MLAgentBench by snap-stanford

Explore Similar Projects

mcpmark by eval-sys

SWE-Gym by SWE-Gym

SWE-bench_Pro-os by scaleapi

AgentCPM by OpenBMB

openbench by groq

KwaiAgents by KwaiKEG

karpathy by K-Dense-AI

crewai-experiments by majacinka

OSWorld by xlang-ai

ai-engineer-toolkit by break-into-data

NeMo-Agent-Toolkit by NVIDIA

AgentLaboratory by SamuelSchmidgall