Open-source sandbox for LLM evaluation via task-based agent simulations
Top 41.6% on sourcepulse
AgentSims provides an open-source sandbox for evaluating Large Language Models (LLMs) through task-based simulations. It addresses limitations in existing evaluation methods by offering customizable task building, robust benchmarks, and objective metrics, making it suitable for researchers across disciplines seeking to test specific LLM capacities.
How It Works
AgentSims utilizes a simulated environment where LLM agents perform tasks. This approach allows for more comprehensive and objective evaluation compared to traditional benchmarks. The system is designed for high customization, enabling users to build their own evaluation tasks via an interactive GUI or by coding new support mechanisms like memory and planning systems.
Quick Start & Requirements
pip install tornado mysql-connector-python websockets openai_async
or pip install -r requirements.txt
config/api_key.json
with LLM API keys and snapshot/logs
directories. MySQL database initialization is necessary.Highlighted Details
Maintenance & Community
@misc{lin2023agentsims, title={AgentSims: An Open-Source Sandbox for Large Language Model Evaluation}, author={Jiaju Lin and Haoran Zhao and Aochi Zhang and Yiting Wu and Huqiuyue Ping and Qin Chen}, year={2023}, eprint={2308.04026}, archivePrefix={arXiv}, primaryClass={cs.AI}}
Licensing & Compatibility
Limitations & Caveats
The project recommends deployment on MacOS or Linux for stability. The README does not specify the license, which may impact commercial or closed-source usage. Initial setup involves several manual steps including database configuration and API key management.
1 year ago
1 day