Multi-modal LLM for 3D medical image analysis
Top 79.7% on sourcepulse
M3D is a comprehensive framework for 3D medical image analysis using multi-modal large language models. It offers a large-scale dataset (M3D-Data), versatile pre-trained models (M3D-LaMed), and an extensive evaluation benchmark (M3D-Bench) covering tasks like retrieval, report generation, VQA, and segmentation. This project targets researchers and developers in medical AI, providing tools to advance diagnostic and analytical capabilities.
How It Works
M3D-LaMed models integrate a pre-trained vision encoder (M3D-CLIP) with large language models (Phi-3-4B, Llama-2-7B). The architecture processes 3D medical images, normalizing and reshaping them into a format compatible with the vision encoder. This encoded visual information is then fused with text prompts, enabling the LLM to perform various downstream tasks. This multi-modal approach allows for a deeper understanding of medical images by leveraging the contextual and generative power of LLMs.
Quick Start & Requirements
pip install -r requirements.txt
.npy
format, normalized to 0-1, and shaped as 1x32x256x256
. GPU with CUDA is recommended for performance.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates including a new model release (M3D-LaMed-Phi-3-4B) and an online demo. Links to Hugging Face and ModelScope provide access to models and data.
Licensing & Compatibility
The project utilizes publicly available data from Radiopaedia, licensed for non-commercial use in machine learning. Citation is requested for use.
Limitations & Caveats
The segmentation task for the M3D-LaMed-Llama-2-7B model had known issues that are being addressed. While 2D images can theoretically be interpolated, the models are primarily trained on 3D data.
3 months ago
1 week