This repository provides instruction-tuned large language models (LLMs) specifically for the Chinese medical domain, named BenTsao (formerly HuaTuo). It aims to improve LLM performance in medical question answering by fine-tuning base models like LLaMA, Bloom, and Huozi with a custom Chinese medical instruction dataset derived from knowledge graphs and literature.
How It Works
The project employs LoRA (Low-Rank Adaptation) for efficient instruction fine-tuning, balancing computational resources and model performance. A key innovation is "knowledge-tuning," which involves a three-stage process: extracting parameters from a question to query a medical knowledge base, retrieving relevant knowledge, and then using this knowledge to generate an answer. This approach aims to make LLMs explicitly utilize structured medical knowledge during inference for more reliable responses.
Quick Start & Requirements
- Install dependencies:
pip install -r requirements.txt
- Python 3.9+ recommended.
- LoRA weights are available via Baidu Netdisk or Hugging Face.
- Inference scripts are provided for different base models and data sources.
- Example inference command:
python infer.py --base_model 'BASE_MODEL_PATH' --lora_weights 'LORA_WEIGHTS_PATH' --use_lora True --instruct_dir 'INFER_DATA_PATH' --prompt_template 'TEMPLATE_PATH'
Highlighted Details
- Offers fine-tuned models based on LLaMA, Bloom, Alpaca-Chinese, and the Huozi (活字) model.
- Dataset construction involves using GPT-3.5 API with medical knowledge graphs (e.g., cMeKG) and medical literature (e.g., 2023 liver cancer literature).
- Published research papers detailing the methodology and datasets.
- LoRA fine-tuning on an A100-SXM-80GB GPU with batch size 128 uses ~40GB VRAM; 24GB VRAM GPUs (3090/4090) are expected to support it.
Maintenance & Community
- Developed by the Health Intelligence Group at the SCIR Center, Harbin Institute of Technology.
- Key contributors and supervising professors are listed.
- References and acknowledges several open-source projects including Huozi, LLaMA, Stanford Alpaca, and CMeKG.
Licensing & Compatibility
- The project explicitly states that all related resources are for academic research only and strictly prohibited for commercial use.
- Use of third-party code is subject to their respective open-source licenses.
Limitations & Caveats
- The project's dataset is largely model-generated and should not be used for actual medical diagnosis.
- The accuracy of model-generated content is not guaranteed due to factors like randomness and quantization.
- The README notes that LLaMA-based models may exhibit occasional errors or repetition due to limited Chinese corpora and a "rough" knowledge integration method; Huozi-based models are recommended for better performance.