MultiMed by leduckhai

Medical speech processing datasets and models

created 6 years ago

354 stars

Top 79.9% on sourcepulse

Project Summary

This repository provides a suite of tools and datasets for multilingual medical speech processing, targeting researchers and developers in healthcare AI. It offers solutions for Automatic Speech Recognition (ASR), Spoken Named Entity Recognition (NER), and Speech Summarization within the medical domain, aiming to improve communication and data extraction from clinical conversations.

How It Works

The project leverages state-of-the-art transformer-based models, including Attention Encoder-Decoder (AED) architectures and pre-trained models like Wav2Vec 2.0 (w2v2-Viet) and XLSR-53. These models are trained and fine-tuned on extensive, newly released medical speech datasets across multiple languages (Vietnamese, English, German, French, Mandarin Chinese). The approach emphasizes large-scale pre-training and fine-tuning for robust generalization, with specific attention to real-time processing and collaborative LLM-human annotation for summarization tasks.

Quick Start & Requirements

Installation and usage instructions are detailed within each sub-project's README (e.g., VietMed, VietMed-NER, VietMed-Sum, MultiMed).
Requires Python and standard deep learning libraries (PyTorch, Transformers). Specific model requirements may vary.
Access to large datasets is necessary for training/fine-tuning.
Links to code, datasets, and models are provided for each component.

Highlighted Details

VietMed: World's largest public medical ASR dataset (16h labeled, 1000h unlabeled medical, 1200h unlabeled general Vietnamese speech), covering all ICD-10 diseases and accents.
VietMed-NER: First medical spoken NER dataset, featuring 18 entity types in Vietnamese.
VietMed-Sum: First speech summarization dataset for medical conversations, with a focus on real-time, deployable systems.
MultiMed: First multilingual medical ASR dataset across five languages, with comparative studies on AED vs. Hybrid models.

Maintenance & Community

The project is led by Le Duc Khai from the University of Toronto. Contact information and GitHub links are provided. The project has multiple publications accepted at major NLP/Speech conferences (LREC-COLING 2024, Interspeech 2024, NAACL 2025, ACL 2025).

Licensing & Compatibility

All code, data, and models are publicly available. Specific licensing details for each dataset and model should be verified in their respective READMEs. Generally permissive for research.

Limitations & Caveats

While extensive, the datasets are primarily focused on Vietnamese and English, with other languages having varying levels of support. The project is research-oriented, and deployment-ready production systems may require further adaptation.

Health Check

Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

177 stars in the last 90 days