Hulu-Med by ZJUI-AI4H

Transparent medical vision-language model

Created 3 months ago

553 stars

Top 57.8% on SourcePulse

Project Summary

Hulu-Med is a transparent, generalist vision-language model designed for holistic medical understanding. It unifies the processing of diverse modalities including text, 2D images, 3D volumes, and videos, targeting researchers and practitioners in the medical AI domain. The project offers state-of-the-art performance on numerous medical benchmarks, trained entirely on publicly available data, promoting accessibility and reproducibility.

How It Works

The architecture features a SigLIP-based vision encoder (with 2D RoPE), a Qwen-based LLM decoder, and a multimodal projector. A key innovation is "Medical-Aware Token Reduction," achieving approximately 55% token reduction for efficient processing across diverse medical data modalities, enabling seamless integration and understanding.

Quick Start & Requirements

Primary Install/Run: Clone repository, create Conda environment (Python 3.10), install dependencies via pip (PyTorch for CUDA 11.8, Flash-attn, Transformers, multimedia libs). HuggingFace Transformers integration is recommended for inference.
Prerequisites: Python 3.10, CUDA 11.8 (for PyTorch), flash-attn, transformers, decord, ffmpeg-python, imageio, opencv-python, nibabel.
Resource Footprint: Training costs range from 4,000-40,000 GPU hours for 7B-32B models, indicating substantial inference hardware requirements.
Links: Paper, HuggingFace Models, ModelScope Models, Demo.

Highlighted Details

Holistic Multimodal Understanding: Integrates medical text, 2D images, 3D volumes, and surgical videos.
Transparency: Offers a fully open-source pipeline, including data curation, training code, and model weights.
State-of-the-Art Performance: Claims superior performance over leading open-source models and competitiveness with proprietary systems on 30 medical benchmarks.
Efficient Training: Model variants (7B-32B) require 4,000-40,000 GPU hours for training.
Comprehensive Data: Trained on 16.7 million samples across 12 anatomical systems and 14 medical imaging modalities.
Transformers Native: Supports HuggingFace Transformers for easy integration.

Maintenance & Community

Developed by the ZJU AI4H Team. No specific community channels (e.g., Discord, Slack) are detailed in the README.

Licensing & Compatibility

Released under the Apache 2.0 License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README indicates that detailed instructions for downloading and preparing the training data are "Coming soon," potentially hindering full reproducibility or custom training efforts.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

34 stars in the last 30 days