This repository provides pre-trained wav2vec 2.0 and HuBERT models for Chinese speech. It targets researchers and developers working on Chinese Automatic Speech Recognition (ASR) and related speech processing tasks, offering significant improvements in character error rate (CER) compared to traditional FBank features.
How It Works
The project leverages the Fairseq toolkit to train wav2vec 2.0 and HuBERT models on 10,000 hours of diverse Chinese speech data from WenetSpeech. This self-supervised approach learns robust speech representations from unlabeled audio, which are then used as feature extractors for downstream ASR tasks. The pre-trained models can be integrated into ASR architectures like Conformer by summing hidden layer representations, replacing conventional acoustic features.
Quick Start & Requirements
- Installation: Requires
fairseq
and transformers
Python packages.
- Dependencies: PyTorch,
soundfile
. GPU with CUDA is recommended for inference.
- Models: Available for download via Baidu Pan (with extraction codes) and Hugging Face.
- Usage: Python code examples are provided for both Fairseq and Hugging Face model loading and feature extraction.
Highlighted Details
- Trained on 10,000 hours of Chinese data from WenetSpeech, covering 10 major recording scenarios.
- Offers both BASE and LARGE model sizes for wav2vec 2.0 and HuBERT.
- Demonstrates significant CER reduction on Aishell and WenetSpeech datasets when used with Conformer ASR models.
- The LARGE HuBERT model achieves a CER of 3.3% on the Aishell test set using 178h of training data.
Maintenance & Community
- Developed by TencentGameMate.
- Project is cited by GPT-SoVITS.
- References extensive academic work in speech processing.
Licensing & Compatibility
- The repository itself does not explicitly state a license.
- Models are provided via Hugging Face, subject to their terms.
- The underlying frameworks (Fairseq, Transformers) have permissive licenses (MIT).
Limitations & Caveats
- Models are pre-trained on audio only and require fine-tuning with labeled text data for ASR tasks.
- Baidu Pan download links may have regional restrictions or require specific clients.
- The project does not include a specific tokenizer, necessitating its creation for fine-tuning.