Hulu-Med  by ZJUI-AI4H

Transparent medical vision-language model

Created 3 weeks ago

New!

329 stars

Top 82.9% on SourcePulse

GitHubView on GitHub
Project Summary

Hulu-Med is a transparent, generalist vision-language model designed for holistic medical understanding. It unifies the processing of diverse modalities including text, 2D images, 3D volumes, and videos, targeting researchers and practitioners in the medical AI domain. The project offers state-of-the-art performance on numerous medical benchmarks, trained entirely on publicly available data, promoting accessibility and reproducibility.

How It Works

The architecture features a SigLIP-based vision encoder (with 2D RoPE), a Qwen-based LLM decoder, and a multimodal projector. A key innovation is "Medical-Aware Token Reduction," achieving approximately 55% token reduction for efficient processing across diverse medical data modalities, enabling seamless integration and understanding.

Quick Start & Requirements

  • Primary Install/Run: Clone repository, create Conda environment (Python 3.10), install dependencies via pip (PyTorch for CUDA 11.8, Flash-attn, Transformers, multimedia libs). HuggingFace Transformers integration is recommended for inference.
  • Prerequisites: Python 3.10, CUDA 11.8 (for PyTorch), flash-attn, transformers, decord, ffmpeg-python, imageio, opencv-python, nibabel.
  • Resource Footprint: Training costs range from 4,000-40,000 GPU hours for 7B-32B models, indicating substantial inference hardware requirements.
  • Links: Paper, HuggingFace Models, ModelScope Models, Demo.

Highlighted Details

  • Holistic Multimodal Understanding: Integrates medical text, 2D images, 3D volumes, and surgical videos.
  • Transparency: Offers a fully open-source pipeline, including data curation, training code, and model weights.
  • State-of-the-Art Performance: Claims superior performance over leading open-source models and competitiveness with proprietary systems on 30 medical benchmarks.
  • Efficient Training: Model variants (7B-32B) require 4,000-40,000 GPU hours for training.
  • Comprehensive Data: Trained on 16.7 million samples across 12 anatomical systems and 14 medical imaging modalities.
  • Transformers Native: Supports HuggingFace Transformers for easy integration.

Maintenance & Community

Developed by the ZJU AI4H Team. No specific community channels (e.g., Discord, Slack) are detailed in the README.

Licensing & Compatibility

Released under the Apache 2.0 License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README indicates that detailed instructions for downloading and preparing the training data are "Coming soon," potentially hindering full reproducibility or custom training efforts.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
9
Star History
329 stars in the last 27 days

Explore Similar Projects

Feedback? Help us improve.