SpeechT5 by microsoft

Unified-modal pre-training for spoken language processing

Created 3 years ago

1,421 stars

Top 28.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

Unified-modal speech-text pre-training for spoken language processing is addressed by SpeechT5, a framework that learns representations beneficial across diverse speech and text tasks like ASR, TTS, and speech translation. It targets researchers and developers seeking advanced speech technology solutions, offering a unified approach to model speech and text data.

How It Works

SpeechT5 adapts the T5 architecture for speech and text by employing a shared encoder-decoder network augmented with modal-specific pre- and post-nets. This design enables sequence-to-sequence transformations across modalities. A key innovation is cross-modal vector quantization, which aligns speech and text information into a unified semantic space by mixing states with latent units. The model is pre-trained on large-scale unlabeled speech and text data.

Quick Start & Requirements

Models are readily available on HuggingFace, simplifying integration. Specific installation commands are not provided, but standard HuggingFace Transformers library usage is implied. Pre-training utilized datasets like LibriSpeech and Libri-Light.

Highlighted Details

Demonstrates strong performance across multiple spoken language tasks:
- ASR: 2.1 WER (Transformer LM on LibriSpeech).
- TTS: 3.65 MOS, +0.290 CMOS (LibriTTS).
- Speech Translation: 25.18 BLEU (EN-DE), 35.30 BLEU (EN-FR) (MUST-C v1).
- Voice Conversion: Achieves competitive WER and MCD.
- Speech Enhancement: 8.9 WER (WHAM!).
- Speaker Identification: 96.49% accuracy (VoxCeleb1).
Supports a family of related models (Speech2C, SpeechLM, SpeechUT) for various speech processing challenges.
Models are released via HuggingFace and Google Drive.

Maintenance & Community

The project shows active development with recent paper releases (2023-2024). For technical issues, users should submit GitHub issues. General inquiries can be directed to Long Zhou (lozou@microsoft.com).

Licensing & Compatibility

Licensed under terms in the LICENSE file. Portions are based on FAIRSEQ and ESPnet, implying potential compatibility considerations depending on those projects' licenses. Specific license type is not detailed here.

Limitations & Caveats

The README does not explicitly list limitations, alpha/beta status, or known bugs. The project's breadth and ongoing research suggest a dynamic development landscape.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days