Discover and explore top open-source AI tools and projects—updated daily.
ASR-projectMultilingual phoneme recognition via self-supervised speech models
Top 99.6% on SourcePulse
This repository, ASR-project/Multilingual-PR, addresses phoneme recognition across diverse languages by leveraging English-pretrained self-supervised models (Wav2vec2, HuBERT, WavLM) and CTC. It targets speech researchers and engineers facing data scarcity for non-English languages, offering a framework to evaluate knowledge transfer from English audio models, facilitating multilingual ASR development.
How It Works
The project compares Wav2vec2, HuBERT, and WavLM models pre-trained on English audio. These are used for phoneme recognition either by fine-tuning on target language data or by extracting frozen features for a linear classifier, all trained with a Connectionist Temporal Classification (CTC) network. This methodology systematically investigates English-centric acoustic feature generalization to phonetically different languages and assesses trade-offs between fine-tuning and feature extraction.
Quick Start & Requirements
An example notebook guides training and testing. Dependencies include HuggingFace Transformers, PyTorch-Lightning, and Weights & Biases. It uses the Mozilla CommonVoice dataset, requiring phonemizer for ground truth conversion. GPU acceleration is implied. Command-line arguments manage hyperparameters, dataset selection, and model configuration.
Highlighted Details
With frozen features, WavLM Large achieved the best average test PER (28.31%), outperforming Wav2vec2 Base (44.41%). Fine-tuned, HuBERT Large showed superior performance (17.36% PER), beating WavLM Base (21.59%). Training data volume significantly impacts results; for Swedish, increasing data from ~10 min to ~3 hours improved HuBERT Large's test PER (frozen) from 39.38% to 32.68%. Performance is also analyzed relative to linguistic proximity to English.
Maintenance & Community
Authored by Apavou Clément, Belkada Younes, Leo Tronchon, and Arthur Zucker. No community channels or sponsorship information are provided.
Licensing & Compatibility
The repository's license is not explicitly stated in the README, requiring further investigation for adoption.
Limitations & Caveats
Reliance on English-pretrained models may limit performance on linguistically distant languages. Phoneme dictionary availability for target languages is a prerequisite. The project appears research-oriented rather than production-ready.
3 years ago
Inactive
openai