Multilingual-PR  by ASR-project

Multilingual phoneme recognition via self-supervised speech models

Created 3 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository, ASR-project/Multilingual-PR, addresses phoneme recognition across diverse languages by leveraging English-pretrained self-supervised models (Wav2vec2, HuBERT, WavLM) and CTC. It targets speech researchers and engineers facing data scarcity for non-English languages, offering a framework to evaluate knowledge transfer from English audio models, facilitating multilingual ASR development.

How It Works

The project compares Wav2vec2, HuBERT, and WavLM models pre-trained on English audio. These are used for phoneme recognition either by fine-tuning on target language data or by extracting frozen features for a linear classifier, all trained with a Connectionist Temporal Classification (CTC) network. This methodology systematically investigates English-centric acoustic feature generalization to phonetically different languages and assesses trade-offs between fine-tuning and feature extraction.

Quick Start & Requirements

An example notebook guides training and testing. Dependencies include HuggingFace Transformers, PyTorch-Lightning, and Weights & Biases. It uses the Mozilla CommonVoice dataset, requiring phonemizer for ground truth conversion. GPU acceleration is implied. Command-line arguments manage hyperparameters, dataset selection, and model configuration.

Highlighted Details

With frozen features, WavLM Large achieved the best average test PER (28.31%), outperforming Wav2vec2 Base (44.41%). Fine-tuned, HuBERT Large showed superior performance (17.36% PER), beating WavLM Base (21.59%). Training data volume significantly impacts results; for Swedish, increasing data from ~10 min to ~3 hours improved HuBERT Large's test PER (frozen) from 39.38% to 32.68%. Performance is also analyzed relative to linguistic proximity to English.

Maintenance & Community

Authored by Apavou Clément, Belkada Younes, Leo Tronchon, and Arthur Zucker. No community channels or sponsorship information are provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, requiring further investigation for adoption.

Limitations & Caveats

Reliance on English-pretrained models may limit performance on linguistically distant languages. Phoneme dictionary availability for target languages is a prerequisite. The project appears research-oriented rather than production-ready.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
43 more.

whisper by openai

0.3%
90k
Speech recognition model for multilingual transcription/translation
Created 3 years ago
Updated 1 month ago
Feedback? Help us improve.