Huatuo-26M  by FreedomIntelligence

Large Chinese medical QA dataset with 26M question-answer pairs

created 2 years ago
285 stars

Top 92.8% on sourcepulse

GitHubView on GitHub
Project Summary

Huatuo-26M is a massive, 26 million-pair Chinese medical question-and-answer dataset designed for AI research in healthcare. It enables the development of advanced NLP applications, machine learning models for medical tasks, and intelligent medical systems, catering to researchers and developers in the medical AI domain.

How It Works

The dataset aggregates Q&A pairs from diverse sources, including online medical encyclopedias, knowledge bases, and consultation records. A refined version, Huatuo-Lite, offers enhanced data quality and additional fields like hospital departments and related diseases. This multi-source approach ensures broad coverage of medical topics, from diseases and symptoms to treatments and drug information.

Quick Start & Requirements

Highlighted Details

  • Over 26 million high-quality Chinese medical Q&A pairs.
  • Covers diseases, symptoms, treatments, and drug information.
  • Huatuo-Lite offers refined data with additional fields (department, related diseases).
  • Usable for Q&A systems, text classification, sentiment analysis, disease prediction, and LLM fine-tuning.

Maintenance & Community

  • Contact: xidongw@163.com or via GitHub Issues.
  • Citation available for academic use.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The huatuo_consultation_qa dataset primarily contains URLs as answers, requiring further processing to extract actionable information. The dataset is primarily focused on Chinese medical information.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.