AgentClinic by SamuelSchmidgall

Multimodal agent benchmark for AI in simulated clinical diagnosis

Created 1 year ago

278 stars

Top 93.6% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> AgentClinic provides a multimodal agent benchmark for evaluating AI performance in simulated clinical environments. It addresses the need for standardized assessment of AI agents in complex medical diagnosis by simulating doctor-patient interactions. This benefits AI researchers and developers by offering a framework to test diagnostic accuracy, safety, and robustness of AI models in healthcare.

How It Works

The project simulates clinical environments using language and vision agents, enabling multimodal AI evaluation. It employs LLMs to act as doctors, patients, or measurement/moderator agents within these simulated scenarios. This approach allows for the assessment of diagnostic reasoning, the impact of simulated biases on decision-making, and the overall effectiveness of AI in clinical contexts.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites:
- API keys for OpenAI or Replicate are required for cloud-based model evaluation.
- HuggingFace models can be used locally by specifying their path (e.g., HF_mistralai/Mixtral-8x7B-v0.1).
- The AgentClinic-MIMIC-IV dataset requires approval from PhysioNet.
Resource Footprint: Can be "quite slow" depending on the models and number of simulations.
Links: Paper (inferred from citation).

Highlighted Details

Supports a wide range of LLMs including OpenAI's GPT series (GPT-4, GPT-4o, GPT-3.5), Anthropic's Claude 3.5 Sonnet, and Meta's Llama 3 70B, as well as HuggingFace models.
Includes multimodal capabilities, supporting vision models for evaluating AI in scenarios involving medical images.
Features new datasets like AgentClinic-MIMIC-IV (based on real clinical cases, requires PhysioNet approval), expanded MedQA, and NEJM case questions.
Allows simulation of various doctor and patient biases (e.g., recency, confirmation, self-diagnosis, socioeconomic) to test AI robustness.

Maintenance & Community

No specific details on maintenance, community channels (like Discord/Slack), or active contributors are provided in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README.

Limitations & Caveats

The README notes that running evaluations, particularly with local HuggingFace models, "Can be quite slow." The MIMIC-IV dataset requires a separate approval process from PhysioNet, adding an adoption hurdle for that specific dataset.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days