UniSpeech  by microsoft

Speech models for self-supervised learning

created 4 years ago
467 stars

Top 66.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides UniSpeech, a family of large-scale self-supervised learning models for speech processing, including WavLM, UniSpeech, and UniSpeech-SAT. It offers pre-trained models and evaluation results for tasks like automatic speech recognition, speaker verification, speech separation, and speaker diarization, targeting researchers and developers in the speech technology domain.

How It Works

The UniSpeech family leverages self-supervised learning on massive unlabeled and labeled speech datasets. Models like WavLM use a masked prediction objective on acoustic frames, while UniSpeech and UniSpeech-SAT incorporate unified pre-training for both self-supervised and supervised learning, with UniSpeech-SAT specifically focusing on speaker-aware pre-training to enhance performance on speaker-related tasks. This approach aims to create universal speech representations applicable across a wide range of downstream tasks.

Quick Start & Requirements

  • Models are available on HuggingFace.
  • Source code is located in the src directory.
  • Specific requirements (e.g., GPU, CUDA, Python versions) are not explicitly detailed in the README but are typical for large deep learning models.

Highlighted Details

  • UniSpeech-SAT large and WavLM large models demonstrate state-of-the-art performance on speaker verification benchmarks, outperforming models like HuBERT and Wav2Vec2.0.
  • Models show competitive results in speech separation and speaker diarization tasks, with UniSpeech-SAT large and WavLM large achieving strong performance.
  • A wide array of pre-trained models are available, covering different languages and training dataset sizes.

Maintenance & Community

  • Contact for issues: Submit a GitHub issue.
  • Contact for other communications: Yu Wu (yuwu1@microsoft.com).

Licensing & Compatibility

  • The project is licensed under the terms found in the LICENSE file.
  • Portions of the code are based on the FAIRSEQ project.

Limitations & Caveats

  • The README does not specify exact installation or setup instructions beyond pointing to the src directory and HuggingFace integration.
  • Detailed hardware or software prerequisites for running the models are not explicitly listed.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.