AgentSims by py499372727

Open-source sandbox for LLM evaluation via task-based agent simulations

Created 2 years ago

922 stars

Top 39.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

AgentSims provides an open-source sandbox for evaluating Large Language Models (LLMs) through task-based simulations. It addresses limitations in existing evaluation methods by offering customizable task building, robust benchmarks, and objective metrics, making it suitable for researchers across disciplines seeking to test specific LLM capacities.

How It Works

AgentSims utilizes a simulated environment where LLM agents perform tasks. This approach allows for more comprehensive and objective evaluation compared to traditional benchmarks. The system is designed for high customization, enabling users to build their own evaluation tasks via an interactive GUI or by coding new support mechanisms like memory and planning systems.

Quick Start & Requirements

Install: pip install tornado mysql-connector-python websockets openai_async or pip install -r requirements.txt
Prerequisites: Python 3.9.x, MySQL 8.0.31.
Setup: Requires creating config/api_key.json with LLM API keys and snapshot/logs directories. MySQL database initialization is necessary.
Docs: Wiki

Highlighted Details

Task-based evaluation framework for LLMs.
Interactive GUI for task building and agent/building creation.
Supports custom memory and planning systems for agents.
Evaluation via customizable QA forms during simulation.

Maintenance & Community

Developed by PTA Studio, focused on open-source NLP architecture and AI games.
Contact: zhaohaoran@buaa.edu.cn
Citation: @misc{lin2023agentsims, title={AgentSims: An Open-Source Sandbox for Large Language Model Evaluation}, author={Jiaju Lin and Haoran Zhao and Aochi Zhang and Yiting Wu and Huqiuyue Ping and Qin Chen}, year={2023}, eprint={2308.04026}, archivePrefix={arXiv}, primaryClass={cs.AI}}

Licensing & Compatibility

License not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project recommends deployment on MacOS or Linux for stability. The README does not specify the license, which may impact commercial or closed-source usage. Initial setup involves several manual steps including database configuration and API key management.

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days