AgentSims  by py499372727

Open-source sandbox for LLM evaluation via task-based agent simulations

created 2 years ago
884 stars

Top 41.6% on sourcepulse

GitHubView on GitHub
Project Summary

AgentSims provides an open-source sandbox for evaluating Large Language Models (LLMs) through task-based simulations. It addresses limitations in existing evaluation methods by offering customizable task building, robust benchmarks, and objective metrics, making it suitable for researchers across disciplines seeking to test specific LLM capacities.

How It Works

AgentSims utilizes a simulated environment where LLM agents perform tasks. This approach allows for more comprehensive and objective evaluation compared to traditional benchmarks. The system is designed for high customization, enabling users to build their own evaluation tasks via an interactive GUI or by coding new support mechanisms like memory and planning systems.

Quick Start & Requirements

  • Install: pip install tornado mysql-connector-python websockets openai_async or pip install -r requirements.txt
  • Prerequisites: Python 3.9.x, MySQL 8.0.31.
  • Setup: Requires creating config/api_key.json with LLM API keys and snapshot/logs directories. MySQL database initialization is necessary.
  • Docs: Wiki

Highlighted Details

  • Task-based evaluation framework for LLMs.
  • Interactive GUI for task building and agent/building creation.
  • Supports custom memory and planning systems for agents.
  • Evaluation via customizable QA forms during simulation.

Maintenance & Community

  • Developed by PTA Studio, focused on open-source NLP architecture and AI games.
  • Contact: zhaohaoran@buaa.edu.cn
  • Citation: @misc{lin2023agentsims, title={AgentSims: An Open-Source Sandbox for Large Language Model Evaluation}, author={Jiaju Lin and Haoran Zhao and Aochi Zhang and Yiting Wu and Huqiuyue Ping and Qin Chen}, year={2023}, eprint={2308.04026}, archivePrefix={arXiv}, primaryClass={cs.AI}}

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project recommends deployment on MacOS or Linux for stability. The README does not specify the license, which may impact commercial or closed-source usage. Initial setup involves several manual steps including database configuration and API key management.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), and
4 more.

AgentBench by THUDM

0.6%
3k
Benchmark for evaluating LLMs as agents across diverse environments
created 2 years ago
updated 6 months ago
Feedback? Help us improve.