Huatuo-26M by FreedomIntelligence

Large Chinese medical QA dataset with 26M question-answer pairs

Created 3 years ago

336 stars

Top 81.9% on SourcePulse

Project Summary

Huatuo-26M is a massive, 26 million-pair Chinese medical question-and-answer dataset designed for AI research in healthcare. It enables the development of advanced NLP applications, machine learning models for medical tasks, and intelligent medical systems, catering to researchers and developers in the medical AI domain.

How It Works

The dataset aggregates Q&A pairs from diverse sources, including online medical encyclopedias, knowledge bases, and consultation records. A refined version, Huatuo-Lite, offers enhanced data quality and additional fields like hospital departments and related diseases. This multi-source approach ensures broad coverage of medical topics, from diseases and symptoms to treatments and drug information.

Quick Start & Requirements

Install via Hugging Face datasets: datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa') (and similar for other sub-datasets).
Requires Python and the datasets library.
Links: Huatuo-Lite, huatuo_encyclopedia_qa, huatuo_knowledge_graph_qa, huatuo_consultation_qa, huatuo26M-testdatasets.

Highlighted Details

Over 26 million high-quality Chinese medical Q&A pairs.
Covers diseases, symptoms, treatments, and drug information.
Huatuo-Lite offers refined data with additional fields (department, related diseases).
Usable for Q&A systems, text classification, sentiment analysis, disease prediction, and LLM fine-tuning.

Maintenance & Community

Contact: xidongw@163.com or via GitHub Issues.
Citation available for academic use.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The huatuo_consultation_qa dataset primarily contains URLs as answers, requiring further processing to extract actionable information. The dataset is primarily focused on Chinese medical information.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days