chinese_speech_pretrain by TencentGameMate

Speech models for Chinese ASR tasks

Created 3 years ago

1,185 stars

Top 32.7% on SourcePulse

Project Summary

This repository provides pre-trained wav2vec 2.0 and HuBERT models for Chinese speech. It targets researchers and developers working on Chinese Automatic Speech Recognition (ASR) and related speech processing tasks, offering significant improvements in character error rate (CER) compared to traditional FBank features.

How It Works

The project leverages the Fairseq toolkit to train wav2vec 2.0 and HuBERT models on 10,000 hours of diverse Chinese speech data from WenetSpeech. This self-supervised approach learns robust speech representations from unlabeled audio, which are then used as feature extractors for downstream ASR tasks. The pre-trained models can be integrated into ASR architectures like Conformer by summing hidden layer representations, replacing conventional acoustic features.

Quick Start & Requirements

Installation: Requires fairseq and transformers Python packages.
Dependencies: PyTorch, soundfile. GPU with CUDA is recommended for inference.
Models: Available for download via Baidu Pan (with extraction codes) and Hugging Face.
Usage: Python code examples are provided for both Fairseq and Hugging Face model loading and feature extraction.

Highlighted Details

Trained on 10,000 hours of Chinese data from WenetSpeech, covering 10 major recording scenarios.
Offers both BASE and LARGE model sizes for wav2vec 2.0 and HuBERT.
Demonstrates significant CER reduction on Aishell and WenetSpeech datasets when used with Conformer ASR models.
The LARGE HuBERT model achieves a CER of 3.3% on the Aishell test set using 178h of training data.

Maintenance & Community

Developed by TencentGameMate.
Project is cited by GPT-SoVITS.
References extensive academic work in speech processing.

Licensing & Compatibility

The repository itself does not explicitly state a license.
Models are provided via Hugging Face, subject to their terms.
The underlying frameworks (Fairseq, Transformers) have permissive licenses (MIT).

Limitations & Caveats

Models are pre-trained on audio only and require fine-tuning with labeled text data for ASR tasks.
Baidu Pan download links may have regional restrictions or require specific clients.
The project does not include a specific tokenizer, necessitating its creation for fine-tuning.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

AudioBench by AudioLLMs

A universal benchmark for evaluating audio large language models

Created 1 year ago

Updated 6 months ago

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

FunCodec by modelscope

Speech codec toolkit for audio quantization and downstream tasks

Created 2 years ago

Updated 1 year ago

deepspeech-german by AASHISHAG

ASR module using Mozilla DeepSpeech for German speech

Created 6 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

espnet_model_zoo by espnet

SDK for managing pretrained ESPnet models, including Hugging Face models

Created 5 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

huggingsound by jonatasgrosman

Speech toolkit for speech-related tasks based on Hugging Face's tools

Created 3 years ago

Updated 2 years ago

Starred by

Taranjeet Singh

Taranjeet Singh(Cofounder of Mem0) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

vakyansh-models by Open-Speech-EkStep

Open-source speech models for Indic languages

Created 4 years ago

Updated 3 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

GigaSpeech by SpeechColab

Large dataset for speech recognition research

Created 4 years ago

Updated 1 year ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

UniSpeech by microsoft

Speech models for self-supervised learning

Created 4 years ago

Updated 1 year ago

parrots by shibing624

ASR/TTS toolkit for multilingual speech processing

Created 7 years ago

Updated 2 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

icefall by k2-fsa

Speech-related recipes for various datasets using k2-fsa and lhotse

Created 4 years ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Piotr Dąbkowski

Piotr Dąbkowski(Cofounder of ElevenLabs), and

2 more.

PaddleSpeech by PaddlePaddle

Speech toolkit for ASR, TTS, speaker verification, translation, and keyword spotting

Created 8 years ago

Updated 2 months ago

Feedback? Help us improve.