HuatuoGPT-Vision by FreedomIntelligence

Medical MLLM for visual question answering

Created 1 year ago

367 stars

Top 76.9% on SourcePulse

Project Summary

HuatuoGPT-Vision addresses the challenge of integrating medical visual knowledge into large multimodal language models (MLLMs). It provides pre-trained models and a large-scale dataset (PubMedVision) to enhance MLLMs' capabilities in medical visual question answering (VQA) and other medical imaging tasks, targeting researchers and developers in the medical AI domain.

How It Works

HuatuoGPT-Vision leverages a multimodal LLM architecture, building upon the Qwen2.5-VL framework. It injects medical visual knowledge by fine-tuning base models on the PubMedVision dataset, a 1.3M medical VQA dataset derived from PubMed image-text pairs and processed by GPT-4V. This approach aims to significantly improve performance on medical VQA benchmarks compared to general-purpose MLLMs.

Quick Start & Requirements

Model Access: Available on Hugging Face as HuatuoGPT-Vision-7B (Qwen2.5-7B backbone) and HuatuoGPT-Vision-34B (Yi-1.5-34B backbone).

Inference:

python cli.py --model_dir path-to-huatuogpt-vision-model

Or via Python:

from cli import HuatuoChatbot
bot = HuatuoChatbot(path-to-huatuogpt-vision-model)
output = bot.inference('What does the picture show?', ['image_path1'])
print(output)

Training: Recommended to use the Qwen2.5-VL framework. Requires transformers library. Training involves vision alignment and vision instruction fine-tuning steps.

Evaluation: Requires downloading the Medical_Multimodal_Evaluation_Data.

accelerate launch eval.py --data_path Medical_Multimodal_Evaluation_Data/medical_multimodel_evaluation_data.json --model_path HuatuoGPT-Vision-7B

Dependencies: Python, accelerate, deepspeed (for training), transformers. Specific hardware requirements (e.g., GPU) are implied for training and inference.
Resources: Links to models, demo, paper, and PubMedVision dataset are provided.

Highlighted Details

Achieves state-of-the-art results on medical VQA benchmarks like VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, and MMMU Health & Medicine.
HuatuoGPT-Vision-34B outperforms LLaVA-v1.6-34B by a significant margin across multiple medical VQA tasks.
The PubMedVision dataset is a key contribution, enabling large-scale injection of medical visual knowledge.
Training code is now based on the more efficient and compatible Qwen2.5-VL framework.

Maintenance & Community

The project is actively updated, with recent releases including training code, evaluation code, and the models themselves. Links to Hugging Face model repositories are provided.

Licensing & Compatibility

The README does not explicitly state the license for the models or the dataset. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research contribution, and specific details regarding production readiness, long-term support, or explicit licensing for commercial use are not detailed in the README. The training process requires significant computational resources and familiarity with distributed training frameworks like DeepSpeed.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days