SpeechT5  by microsoft

Unified-modal pre-training for spoken language processing

Created 3 years ago
1,401 stars

Top 28.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Unified-modal speech-text pre-training for spoken language processing is addressed by SpeechT5, a framework that learns representations beneficial across diverse speech and text tasks like ASR, TTS, and speech translation. It targets researchers and developers seeking advanced speech technology solutions, offering a unified approach to model speech and text data.

How It Works

SpeechT5 adapts the T5 architecture for speech and text by employing a shared encoder-decoder network augmented with modal-specific pre- and post-nets. This design enables sequence-to-sequence transformations across modalities. A key innovation is cross-modal vector quantization, which aligns speech and text information into a unified semantic space by mixing states with latent units. The model is pre-trained on large-scale unlabeled speech and text data.

Quick Start & Requirements

Models are readily available on HuggingFace, simplifying integration. Specific installation commands are not provided, but standard HuggingFace Transformers library usage is implied. Pre-training utilized datasets like LibriSpeech and Libri-Light.

Highlighted Details

  • Demonstrates strong performance across multiple spoken language tasks:
    • ASR: 2.1 WER (Transformer LM on LibriSpeech).
    • TTS: 3.65 MOS, +0.290 CMOS (LibriTTS).
    • Speech Translation: 25.18 BLEU (EN-DE), 35.30 BLEU (EN-FR) (MUST-C v1).
    • Voice Conversion: Achieves competitive WER and MCD.
    • Speech Enhancement: 8.9 WER (WHAM!).
    • Speaker Identification: 96.49% accuracy (VoxCeleb1).
  • Supports a family of related models (Speech2C, SpeechLM, SpeechUT) for various speech processing challenges.
  • Models are released via HuggingFace and Google Drive.

Maintenance & Community

The project shows active development with recent paper releases (2023-2024). For technical issues, users should submit GitHub issues. General inquiries can be directed to Long Zhou (lozou@microsoft.com).

Licensing & Compatibility

Licensed under terms in the LICENSE file. Portions are based on FAIRSEQ and ESPnet, implying potential compatibility considerations depending on those projects' licenses. Specific license type is not detailed here.

Limitations & Caveats

The README does not explicitly list limitations, alpha/beta status, or known bugs. The project's breadth and ongoing research suggest a dynamic development landscape.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.