Large Chinese medical QA dataset with 26M question-answer pairs
Top 92.8% on sourcepulse
Huatuo-26M is a massive, 26 million-pair Chinese medical question-and-answer dataset designed for AI research in healthcare. It enables the development of advanced NLP applications, machine learning models for medical tasks, and intelligent medical systems, catering to researchers and developers in the medical AI domain.
How It Works
The dataset aggregates Q&A pairs from diverse sources, including online medical encyclopedias, knowledge bases, and consultation records. A refined version, Huatuo-Lite, offers enhanced data quality and additional fields like hospital departments and related diseases. This multi-source approach ensures broad coverage of medical topics, from diseases and symptoms to treatments and drug information.
Quick Start & Requirements
datasets
: datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
(and similar for other sub-datasets).datasets
library.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The huatuo_consultation_qa
dataset primarily contains URLs as answers, requiring further processing to extract actionable information. The dataset is primarily focused on Chinese medical information.
1 year ago
1 week