This repository provides the official code and models for "Towards Building Multilingual Language Model for Medicine," a project focused on creating open-source, multilingual LLMs for the medical domain. It offers a large multilingual medical corpus (MMedC), a medical question-answering benchmark (MMedBench), and several pre-trained and fine-tuned models, including MMed-Llama3.1-70B, which rivals GPT-4 performance across multiple languages.
How It Works
The project's approach involves constructing a substantial multilingual medical corpus (MMedC) of 25.5 billion tokens across six languages for auto-regressive pre-training of general LLMs. It also introduces MMedBench, a multilingual medical multiple-choice QA benchmark with rationales, to evaluate and monitor model progress. The models are then further trained or fine-tuned on these resources, demonstrating significant performance gains over existing open-source medical LLMs and competitive results against proprietary models like GPT-4.
Quick Start & Requirements
- Installation: Code is provided in folders for pre-training, fine-tuning, and inference. Specific dependencies include PyTorch 1.13 and Transformers 4.37. For LoRA fine-tuning, the PEFT library is required.
- Hardware: Auto-regressive training on MMedC requires at least 8 A100 80GB GPUs and extended training periods (over a month). Inference and fine-tuning can be adapted for single machines by removing Slurm commands.
- Resources: The project offers models of various sizes (1.8B, 7B, 8B, 70B parameters).
- Links: Paper (Arxiv): https://arxiv.org/abs/2402.13963, Leaderboard: https://github.com/MAGIC-AI4Med/MMedLM/blob/main/leaderboard.md
Highlighted Details
- MMed-Llama3.1-70B achieves 80.51 on MMedBench, outperforming GPT-4 (74.27) and supporting 8 languages.
- MMedLM 2 (7B) rivals GPT-4 on MMedBench.
- MMed-Llama 3 (8B) shows superior performance on English benchmarks like MedQA (65.4) and MMedBench (79.25) compared to Llama 3 (60.9 and 63.86 respectively).
- The project releases the data collection pipeline, including filtering and OCR code.
Maintenance & Community
- The project is associated with Nature Communications and has active releases, including recent models like MMed-Llama3.1-70B.
- Contact: qiupengcheng@pjlab.org.cn.
Licensing & Compatibility
- The repository is released under the Apache 2.0 license.
- Compatibility for commercial use is generally permissive due to the Apache 2.0 license.
Limitations & Caveats
- Full auto-regressive training on the MMedC corpus is computationally intensive, requiring significant GPU resources and time.
- While open-source models are fine-tuned on the MMedBench trainset before evaluation, proprietary models like GPT-3.5/4 and Gemini are evaluated zero-shot via API.