PromptCBLUE by michael-wzhu

Chinese instruction-tuning dataset for multi-task, few-shot medical NLP

Created 2 years ago

387 stars

Top 74.1% on SourcePulse

Project Summary

PromptCBLUE is a benchmark dataset and evaluation framework for large language models (LLMs) in the Chinese medical domain. It transforms 16 existing CBLUE tasks into prompt-based generation tasks, aiming to standardize LLM evaluation in medical NLP. The project targets researchers and developers working with LLMs in healthcare, providing a unified platform for assessing model performance on diverse medical NLP challenges.

How It Works

PromptCBLUE reformulates 16 medical NLP tasks from the CBLUE benchmark into a prompt-based generation format. Each task is converted into an input, target, type, and answer_choices structure, suitable for LLM processing. This approach allows for a unified evaluation of LLMs across various medical NLP tasks, leveraging the prompt-engineering paradigm.

Quick Start & Requirements

Dataset Download: Access full datasets via the PromptCBLUE evaluation websites (General Track or Open Source Track). Toy examples are available in datasets/toy_examples.
Submission: Submit a test_predictions.json file and a post_generate_process.py script (Python standard library only) for evaluation.
Resources: The project provides baseline code using ChatGLM-6B with p-tuning and LoRA. It also offers pre-trained models like ChatMed-Consult and ChatMed-TCM, fine-tuned on LLaMA.

Highlighted Details

Offers two tracks: a General Track for any LLM and an Open Source Track requiring open-source base models and datasets.
Includes supplementary resources: ChatMed_Consult_Dataset (500k+ online consultations with ChatGPT replies) and ChatMed_TCM_Dataset (26k+ TCM instructions).
Provides baseline implementations using ChatGLM-6B with p-tuning and LoRA, showing competitive performance.
Supports evaluation of ChatGPT via in-context learning (ICL) as a reference.

Maintenance & Community

Organized by researchers from East China Normal University, Alibaba, Huashan Hospital, Northeastern University, and others.
Evaluation is hosted on the Tianchi platform.
Community discussion channels include DingTalk and WeChat groups.

Licensing & Compatibility

Resources are for academic research use only; commercial use is strictly prohibited.
The project is based on the CBLUE benchmark.

Limitations & Caveats

The project explicitly forbids using public LLM APIs (GPT-4, ChatGPT, etc.) for test set predictions, except for the model's developers.
Participants must disclose their training methods and data sources.
The "Open Source Track" requires adherence to specific data and model licensing for training.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days