Medical MLLM for visual question answering
Top 84.6% on sourcepulse
HuatuoGPT-Vision addresses the challenge of integrating medical visual knowledge into large multimodal language models (MLLMs). It provides pre-trained models and a large-scale dataset (PubMedVision) to enhance MLLMs' capabilities in medical visual question answering (VQA) and other medical imaging tasks, targeting researchers and developers in the medical AI domain.
How It Works
HuatuoGPT-Vision leverages a multimodal LLM architecture, building upon the Qwen2.5-VL framework. It injects medical visual knowledge by fine-tuning base models on the PubMedVision dataset, a 1.3M medical VQA dataset derived from PubMed image-text pairs and processed by GPT-4V. This approach aims to significantly improve performance on medical VQA benchmarks compared to general-purpose MLLMs.
Quick Start & Requirements
HuatuoGPT-Vision-7B
(Qwen2.5-7B backbone) and HuatuoGPT-Vision-34B
(Yi-1.5-34B backbone).python cli.py --model_dir path-to-huatuogpt-vision-model
Or via Python:
from cli import HuatuoChatbot
bot = HuatuoChatbot(path-to-huatuogpt-vision-model)
output = bot.inference('What does the picture show?', ['image_path1'])
print(output)
transformers
library. Training involves vision alignment and vision instruction fine-tuning steps.Medical_Multimodal_Evaluation_Data
.
accelerate launch eval.py --data_path Medical_Multimodal_Evaluation_Data/medical_multimodel_evaluation_data.json --model_path HuatuoGPT-Vision-7B
accelerate
, deepspeed
(for training), transformers
. Specific hardware requirements (e.g., GPU) are implied for training and inference.Highlighted Details
Maintenance & Community
The project is actively updated, with recent releases including training code, evaluation code, and the models themselves. Links to Hugging Face model repositories are provided.
Licensing & Compatibility
The README does not explicitly state the license for the models or the dataset. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is presented as a research contribution, and specific details regarding production readiness, long-term support, or explicit licensing for commercial use are not detailed in the README. The training process requires significant computational resources and familiarity with distributed training frameworks like DeepSpeed.
3 months ago
1+ week