Medical speech processing datasets and models
Top 79.9% on sourcepulse
This repository provides a suite of tools and datasets for multilingual medical speech processing, targeting researchers and developers in healthcare AI. It offers solutions for Automatic Speech Recognition (ASR), Spoken Named Entity Recognition (NER), and Speech Summarization within the medical domain, aiming to improve communication and data extraction from clinical conversations.
How It Works
The project leverages state-of-the-art transformer-based models, including Attention Encoder-Decoder (AED) architectures and pre-trained models like Wav2Vec 2.0 (w2v2-Viet) and XLSR-53. These models are trained and fine-tuned on extensive, newly released medical speech datasets across multiple languages (Vietnamese, English, German, French, Mandarin Chinese). The approach emphasizes large-scale pre-training and fine-tuning for robust generalization, with specific attention to real-time processing and collaborative LLM-human annotation for summarization tasks.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is led by Le Duc Khai from the University of Toronto. Contact information and GitHub links are provided. The project has multiple publications accepted at major NLP/Speech conferences (LREC-COLING 2024, Interspeech 2024, NAACL 2025, ACL 2025).
Licensing & Compatibility
All code, data, and models are publicly available. Specific licensing details for each dataset and model should be verified in their respective READMEs. Generally permissive for research.
Limitations & Caveats
While extensive, the datasets are primarily focused on Vietnamese and English, with other languages having varying levels of support. The project is research-oriented, and deployment-ready production systems may require further adaptation.
1 month ago
Inactive