PromptCBLUE is a benchmark dataset and evaluation framework for large language models (LLMs) in the Chinese medical domain. It transforms 16 existing CBLUE tasks into prompt-based generation tasks, aiming to standardize LLM evaluation in medical NLP. The project targets researchers and developers working with LLMs in healthcare, providing a unified platform for assessing model performance on diverse medical NLP challenges.
How It Works
PromptCBLUE reformulates 16 medical NLP tasks from the CBLUE benchmark into a prompt-based generation format. Each task is converted into an input
, target
, type
, and answer_choices
structure, suitable for LLM processing. This approach allows for a unified evaluation of LLMs across various medical NLP tasks, leveraging the prompt-engineering paradigm.
Quick Start & Requirements
- Dataset Download: Access full datasets via the PromptCBLUE evaluation websites (General Track or Open Source Track). Toy examples are available in
datasets/toy_examples
.
- Submission: Submit a
test_predictions.json
file and a post_generate_process.py
script (Python standard library only) for evaluation.
- Resources: The project provides baseline code using ChatGLM-6B with p-tuning and LoRA. It also offers pre-trained models like ChatMed-Consult and ChatMed-TCM, fine-tuned on LLaMA.
Highlighted Details
- Offers two tracks: a General Track for any LLM and an Open Source Track requiring open-source base models and datasets.
- Includes supplementary resources: ChatMed_Consult_Dataset (500k+ online consultations with ChatGPT replies) and ChatMed_TCM_Dataset (26k+ TCM instructions).
- Provides baseline implementations using ChatGLM-6B with p-tuning and LoRA, showing competitive performance.
- Supports evaluation of ChatGPT via in-context learning (ICL) as a reference.
Maintenance & Community
- Organized by researchers from East China Normal University, Alibaba, Huashan Hospital, Northeastern University, and others.
- Evaluation is hosted on the Tianchi platform.
- Community discussion channels include DingTalk and WeChat groups.
Licensing & Compatibility
- Resources are for academic research use only; commercial use is strictly prohibited.
- The project is based on the CBLUE benchmark.
Limitations & Caveats
- The project explicitly forbids using public LLM APIs (GPT-4, ChatGPT, etc.) for test set predictions, except for the model's developers.
- Participants must disclose their training methods and data sources.
- The "Open Source Track" requires adherence to specific data and model licensing for training.