HealthGPT  by DCDmllm

Medical vision-language model for comprehension/generation, per research paper

Created 7 months ago
1,506 stars

Top 27.5% on SourcePulse

GitHubView on GitHub
Project Summary

HealthGPT is a medical Large Vision-Language Model (LVLM) designed for unified medical visual comprehension and generation. It targets researchers and practitioners in the medical AI domain, offering a single framework to process and generate medical data based on both text and image inputs.

How It Works

HealthGPT employs a heterogeneous low-rank adaptation (H-LoRA) technique and a three-stage learning strategy to adapt pre-trained LLMs for visual tasks. Its architecture features hierarchical visual perception and a task-specific hard router to select relevant visual features and H-LoRA plugins, enabling autoregressive text and image generation.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (conda create -n HealthGPT python=3.10), activate it (conda activate HealthGPT), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Requires downloading pre-trained weights for the visual encoder (clip-vit-large-patch14-336), base LLMs (Phi-3-mini-4k-instruct or Phi-4), and potentially VQGAN weights for generation. H-LoRA and adapter weights are also needed.
  • Inference: Run inference via provided shell scripts (com_infer.sh, gen_infer.sh) or Python commands, specifying paths to downloaded weights and model configurations. A Gradio UI is available via python app.py.
  • Resources: Specific hardware requirements are not detailed, but LLVMs typically demand significant GPU memory.
  • Links: HuggingFace (for weights), VL-Health Dataset, Paper.

Highlighted Details

  • Supports 7 medical comprehension and 5 medical generation tasks.
  • Achieved a score of 70.4 with HealthGPT-XL32 (based on Qwen2.5-32B-Instruct), outperforming previous versions.
  • Built upon LLaVA, LLaVA++, and Taming Transformers.

Maintenance & Community

The project is associated with multiple academic institutions and Alibaba. The README indicates ongoing development with planned releases for training scripts and a project website.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting commercial use and closed-source linking.

Limitations & Caveats

Full training scripts and complete H-LoRA weights for generation tasks are not yet released. The project is based on recent models (e.g., Qwen2.5-32B-Instruct), which may have their own specific hardware demands.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.