Multilingual-PR by ASR-project

Multilingual phoneme recognition via self-supervised speech models

Created 3 years ago

257 stars

Top 98.3% on SourcePulse

Project Summary

This repository, ASR-project/Multilingual-PR, addresses phoneme recognition across diverse languages by leveraging English-pretrained self-supervised models (Wav2vec2, HuBERT, WavLM) and CTC. It targets speech researchers and engineers facing data scarcity for non-English languages, offering a framework to evaluate knowledge transfer from English audio models, facilitating multilingual ASR development.

How It Works

The project compares Wav2vec2, HuBERT, and WavLM models pre-trained on English audio. These are used for phoneme recognition either by fine-tuning on target language data or by extracting frozen features for a linear classifier, all trained with a Connectionist Temporal Classification (CTC) network. This methodology systematically investigates English-centric acoustic feature generalization to phonetically different languages and assesses trade-offs between fine-tuning and feature extraction.

Quick Start & Requirements

An example notebook guides training and testing. Dependencies include HuggingFace Transformers, PyTorch-Lightning, and Weights & Biases. It uses the Mozilla CommonVoice dataset, requiring phonemizer for ground truth conversion. GPU acceleration is implied. Command-line arguments manage hyperparameters, dataset selection, and model configuration.

Highlighted Details

With frozen features, WavLM Large achieved the best average test PER (28.31%), outperforming Wav2vec2 Base (44.41%). Fine-tuned, HuBERT Large showed superior performance (17.36% PER), beating WavLM Base (21.59%). Training data volume significantly impacts results; for Swedish, increasing data from ~10 min to ~3 hours improved HuBERT Large's test PER (frozen) from 39.38% to 32.68%. Performance is also analyzed relative to linguistic proximity to English.

Maintenance & Community

Authored by Apavou Clément, Belkada Younes, Leo Tronchon, and Arthur Zucker. No community channels or sponsorship information are provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, requiring further investigation for adoption.

Limitations & Caveats

Reliance on English-pretrained models may limit performance on linguistically distant languages. Phoneme dictionary availability for target languages is a prerequisite. The project appears research-oriented rather than production-ready.

Multilingual-PR by ASR-project

Explore Similar Projects

speech-recognition-uk by egorsmkv

deepspeech-german by AASHISHAG

PortaSpeech by keonlee9420

Fun-ASR by FunAudioLLM

LLaSM by LinkSoul-AI

Cross-Lingual-Voice-Cloning by deterministic-algorithms-lab

UniSpeech by microsoft

Multilingual_Text_to_Speech by Tomiinek

seamless_communication by facebookresearch

FunASR by modelscope

unilm by microsoft

whisper by openai