Discover and explore top open-source AI tools and projects—updated daily.
ML experimentation benchmark for evaluating language agents
Top 87.5% on SourcePulse
MLAgentBench provides a framework for evaluating AI agents on end-to-end machine learning experimentation. It targets AI researchers and developers seeking to benchmark autonomous ML workflows, offering a standardized environment for agents to tackle diverse ML tasks from dataset preparation to model optimization.
How It Works
MLAgentBench simulates real-world ML research environments, presenting agents with datasets and task descriptions. Agents interact by reading files, executing experiments on a compute cluster, and analyzing results. This approach allows for direct comparison of agent capabilities across 13 distinct ML engineering tasks, mimicking the iterative process human researchers follow.
Quick Start & Requirements
pip install -e .
or use the provided Docker image (qhwang123/researchassistant:latest
)..kaggle/kaggle.json
and setting KAGGLE_CONFIG_DIR
. LLM API keys need to be in openai_api_key.txt
, claude_api_key.txt
, or crfm_api_key.txt
.python -u -m MLAgentBench.runner --python $(which python) --task cifar10 --device 0 --log-dir first_test --work-dir workspace --llm-name gpt-4 --edit-script-llm-name gpt-4 --fast-llm-name gpt-3.5-turbo
Highlighted Details
Maintenance & Community
The project is associated with Stanford University. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
The interactive mode is noted as "under construction." Some workflow scripts (run_experiments.sh
, baseline.sh
, eval.sh
) may require manual path and name adjustments (marked with TODO).
1 year ago
Inactive