MedAgentBench by stanfordmlgroup

Benchmark medical LLM agents in a realistic virtual EHR

Created 1 year ago

296 stars

Top 89.4% on SourcePulse

Project Summary

Summary

MedAgentBench offers a realistic virtual Electronic Health Record (EHR) environment designed to benchmark the performance of Large Language Model (LLM) agents in medical contexts. It targets researchers and developers evaluating LLM capabilities for clinical applications, providing a standardized platform to assess agent accuracy, safety, and efficiency within simulated healthcare workflows.

How It Works

This project extends the AgentBench framework by creating a sophisticated virtual EHR simulation. It leverages a FHIR (Fast Healthcare Interoperability Resources) server to manage patient data and clinical context. LLM agents interact with this simulated EHR, performing tasks such as diagnosis, treatment planning, or patient interaction, allowing for quantitative evaluation of their medical reasoning and decision-making processes.

Quick Start & Requirements

Setup involves cloning the repository and installing Python 3.9 dependencies via Conda (conda create -n medagentbench python=3.9, conda activate medagentbench, pip install -r requirements.txt). Docker is a mandatory prerequisite for running the FHIR server (docker pull jyxsu6/medagentbench:latest, docker run -p 8080:8080 medagentbench). Users must configure API keys for their chosen LLM (e.g., OpenAI) in configs/agents/openai-chat.yaml. Verification can be done with python -m src.client.agent_test. Task execution requires starting task workers (python -m src.start_task -a) and then the assigner (python -m src.assigner). Results are saved to outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json.

Highlighted Details

Realistic EHR Simulation: Provides a virtual environment mimicking real-world Electronic Health Record systems for authentic agent testing.
FHIR Server Integration: Utilizes the FHIR standard for data representation and interoperability within the simulated EHR.
LLM Agent Benchmarking: Specifically designed for evaluating and comparing the performance of various LLM agents on medical tasks.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmaps are provided in the README.

Licensing & Compatibility

The project is explicitly stated to be for "research purpose" and may not be suitable for "large-scale production." No specific open-source license is mentioned, suggesting potential restrictions on commercial use or redistribution.

Limitations & Caveats

The primary limitation is its intended use case: the environment is designed for research and may not be robust or scalable enough for production deployment in real-world healthcare systems.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days