MedAgentBench  by stanfordmlgroup

Benchmark medical LLM agents in a realistic virtual EHR

Created 1 year ago
276 stars

Top 93.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MedAgentBench offers a realistic virtual Electronic Health Record (EHR) environment designed to benchmark the performance of Large Language Model (LLM) agents in medical contexts. It targets researchers and developers evaluating LLM capabilities for clinical applications, providing a standardized platform to assess agent accuracy, safety, and efficiency within simulated healthcare workflows.

How It Works

This project extends the AgentBench framework by creating a sophisticated virtual EHR simulation. It leverages a FHIR (Fast Healthcare Interoperability Resources) server to manage patient data and clinical context. LLM agents interact with this simulated EHR, performing tasks such as diagnosis, treatment planning, or patient interaction, allowing for quantitative evaluation of their medical reasoning and decision-making processes.

Quick Start & Requirements

Setup involves cloning the repository and installing Python 3.9 dependencies via Conda (conda create -n medagentbench python=3.9, conda activate medagentbench, pip install -r requirements.txt). Docker is a mandatory prerequisite for running the FHIR server (docker pull jyxsu6/medagentbench:latest, docker run -p 8080:8080 medagentbench). Users must configure API keys for their chosen LLM (e.g., OpenAI) in configs/agents/openai-chat.yaml. Verification can be done with python -m src.client.agent_test. Task execution requires starting task workers (python -m src.start_task -a) and then the assigner (python -m src.assigner). Results are saved to outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json.

Highlighted Details

  • Realistic EHR Simulation: Provides a virtual environment mimicking real-world Electronic Health Record systems for authentic agent testing.
  • FHIR Server Integration: Utilizes the FHIR standard for data representation and interoperability within the simulated EHR.
  • LLM Agent Benchmarking: Specifically designed for evaluating and comparing the performance of various LLM agents on medical tasks.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmaps are provided in the README.

Licensing & Compatibility

The project is explicitly stated to be for "research purpose" and may not be suitable for "large-scale production." No specific open-source license is mentioned, suggesting potential restrictions on commercial use or redistribution.

Limitations & Caveats

The primary limitation is its intended use case: the environment is designed for research and may not be robust or scalable enough for production deployment in real-world healthcare systems.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
1 more.

AgentLaboratory by SamuelSchmidgall

0.2%
6k
Agentic framework for autonomous research workflows
Created 1 year ago
Updated 9 months ago
Feedback? Help us improve.