TeleSpeech-ASR  by Tele-AI

Speech model for diverse dialects

Created 1 year ago
764 stars

Top 45.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the TeleSpeech-ASR large model, an automatic speech recognition system capable of understanding over 30 Chinese dialects. It is designed for researchers and developers working with diverse Chinese dialects, offering pre-trained models and fine-tuning frameworks to achieve high accuracy with limited labeled data.

How It Works

The project leverages a self-supervised pre-training approach on 300,000 hours of unlabeled multi-dialectal speech data. This is followed by fine-tuning on 30 types of labeled dialect data. The core advantage lies in its ability to break the limitation of single-dialect models, enabling a unified model to comprehend a wide range of dialects. Users can either fine-tune the pre-trained models using frameworks like Fairseq or use them as feature extractors with Wenet for downstream ASR tasks.

Quick Start & Requirements

  • Installation: Install Fairseq (pip install --editable ./ within the cloned Fairseq directory) and other dependencies (pip install -r requirements.txt or specific packages like kaldiio, timm, editdistance, soundfile). Kaldi is required for feature extraction unless using kaldi_io.py.
  • Prerequisites: PyTorch >= 1.13.0, Python >= 3.8. Feature extraction requires Kaldi.
  • Data Preparation: Audio features (40-dim MFCC) need to be extracted using Kaldi scripts. Data lists in a specific format (.tsv) are required for training and inference.
  • Links: Fairseq: https://github.com/pytorch/fairseq, Kaldi: https://github.com/kaldi-asr/kaldi

Highlighted Details

  • Offers three open-sourced models: two pre-trained models (0.09B and 0.3B parameters) and one fine-tuned model on the KeSpeech dataset (0.3B parameters).
  • Achieves competitive Character Error Rates (CER) on various datasets, including Aishell-1 (4.0% with pretrain_large), WenetSpeech (13.0% with pretrain_large), Babel (19.1% with pretrain_large), and KeSpeech (8.1% with pretrain_large).
  • Provides detailed instructions for fine-tuning pre-trained models and for using them as feature extractors for downstream ASR tasks via Wenet.

Maintenance & Community

  • The project is associated with Tele-AI. Further community interaction details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

  • The model is released under the "TeleSpeech Model Community License Agreement."
  • Commercial use is permitted upon application and approval via email (tele_ai@chinatelecom.cn), granting a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable commercial license.

Limitations & Caveats

The project statement strongly advises against using the TeleSpeech models for any activities that harm national social security or are illegal, and requires security review and filing for internet services. The authors disclaim responsibility for any issues arising from data security, public opinion risks, or misuse of the model, despite efforts to ensure data compliance. Unsupervised pre-trained models (pretrain_base, pretrain_large) require supervised training before direct inference.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.