TeleSpeech-ASR by Tele-AI

Speech model for diverse dialects

Created 1 year ago

823 stars

Top 43.1% on SourcePulse

Project Summary

This repository provides the TeleSpeech-ASR large model, an automatic speech recognition system capable of understanding over 30 Chinese dialects. It is designed for researchers and developers working with diverse Chinese dialects, offering pre-trained models and fine-tuning frameworks to achieve high accuracy with limited labeled data.

How It Works

The project leverages a self-supervised pre-training approach on 300,000 hours of unlabeled multi-dialectal speech data. This is followed by fine-tuning on 30 types of labeled dialect data. The core advantage lies in its ability to break the limitation of single-dialect models, enabling a unified model to comprehend a wide range of dialects. Users can either fine-tune the pre-trained models using frameworks like Fairseq or use them as feature extractors with Wenet for downstream ASR tasks.

Quick Start & Requirements

Installation: Install Fairseq (pip install --editable ./ within the cloned Fairseq directory) and other dependencies (pip install -r requirements.txt or specific packages like kaldiio, timm, editdistance, soundfile). Kaldi is required for feature extraction unless using kaldi_io.py.
Prerequisites: PyTorch >= 1.13.0, Python >= 3.8. Feature extraction requires Kaldi.
Data Preparation: Audio features (40-dim MFCC) need to be extracted using Kaldi scripts. Data lists in a specific format (.tsv) are required for training and inference.
Links: Fairseq: https://github.com/pytorch/fairseq, Kaldi: https://github.com/kaldi-asr/kaldi

Highlighted Details

Offers three open-sourced models: two pre-trained models (0.09B and 0.3B parameters) and one fine-tuned model on the KeSpeech dataset (0.3B parameters).
Achieves competitive Character Error Rates (CER) on various datasets, including Aishell-1 (4.0% with pretrain_large), WenetSpeech (13.0% with pretrain_large), Babel (19.1% with pretrain_large), and KeSpeech (8.1% with pretrain_large).
Provides detailed instructions for fine-tuning pre-trained models and for using them as feature extractors for downstream ASR tasks via Wenet.

Maintenance & Community

The project is associated with Tele-AI. Further community interaction details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The model is released under the "TeleSpeech Model Community License Agreement."
Commercial use is permitted upon application and approval via email (tele_ai@chinatelecom.cn), granting a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable commercial license.

Limitations & Caveats

The project statement strongly advises against using the TeleSpeech models for any activities that harm national social security or are illegal, and requires security review and filing for internet services. The authors disclaim responsibility for any issues arising from data security, public opinion risks, or misuse of the model, despite efforts to ensure data compliance. Unsupervised pre-trained models (pretrain_base, pretrain_large) require supervised training before direct inference.

TeleSpeech-ASR by Tele-AI

Explore Similar Projects

WenetSpeech-Yue by ASLP-lab

YAYI2 by wenge-research

Chinese-Mixtral by ymcui

fast-whisper-finetuning by Vaibhavs10

Fun-ASR by FunAudioLLM

WenetSpeech by wenet-e2e

chinese_speech_pretrain by TencentGameMate

Speech-Transformer by kaituoxu

omnilingual-asr by facebookresearch

gpt2-ml by imcaspar

metavoice-src by metavoiceio

BELLE by LianjiaTech