Discover and explore top open-source AI tools and projects—updated daily.
stanfordmlgroupBenchmark medical LLM agents in a realistic virtual EHR
Top 93.8% on SourcePulse
Summary
MedAgentBench offers a realistic virtual Electronic Health Record (EHR) environment designed to benchmark the performance of Large Language Model (LLM) agents in medical contexts. It targets researchers and developers evaluating LLM capabilities for clinical applications, providing a standardized platform to assess agent accuracy, safety, and efficiency within simulated healthcare workflows.
How It Works
This project extends the AgentBench framework by creating a sophisticated virtual EHR simulation. It leverages a FHIR (Fast Healthcare Interoperability Resources) server to manage patient data and clinical context. LLM agents interact with this simulated EHR, performing tasks such as diagnosis, treatment planning, or patient interaction, allowing for quantitative evaluation of their medical reasoning and decision-making processes.
Quick Start & Requirements
Setup involves cloning the repository and installing Python 3.9 dependencies via Conda (conda create -n medagentbench python=3.9, conda activate medagentbench, pip install -r requirements.txt). Docker is a mandatory prerequisite for running the FHIR server (docker pull jyxsu6/medagentbench:latest, docker run -p 8080:8080 medagentbench). Users must configure API keys for their chosen LLM (e.g., OpenAI) in configs/agents/openai-chat.yaml. Verification can be done with python -m src.client.agent_test. Task execution requires starting task workers (python -m src.start_task -a) and then the assigner (python -m src.assigner). Results are saved to outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json.
Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmaps are provided in the README.
Licensing & Compatibility
The project is explicitly stated to be for "research purpose" and may not be suitable for "large-scale production." No specific open-source license is mentioned, suggesting potential restrictions on commercial use or redistribution.
Limitations & Caveats
The primary limitation is its intended use case: the environment is designed for research and may not be robust or scalable enough for production deployment in real-world healthcare systems.
6 months ago
Inactive
SamuelSchmidgall