UniSpeech by microsoft

Speech models for self-supervised learning

Created 4 years ago

476 stars

Top 64.1% on SourcePulse

1 Expert Loves This Project

patrickvonplaten

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

This repository provides UniSpeech, a family of large-scale self-supervised learning models for speech processing, including WavLM, UniSpeech, and UniSpeech-SAT. It offers pre-trained models and evaluation results for tasks like automatic speech recognition, speaker verification, speech separation, and speaker diarization, targeting researchers and developers in the speech technology domain.

How It Works

The UniSpeech family leverages self-supervised learning on massive unlabeled and labeled speech datasets. Models like WavLM use a masked prediction objective on acoustic frames, while UniSpeech and UniSpeech-SAT incorporate unified pre-training for both self-supervised and supervised learning, with UniSpeech-SAT specifically focusing on speaker-aware pre-training to enhance performance on speaker-related tasks. This approach aims to create universal speech representations applicable across a wide range of downstream tasks.

Quick Start & Requirements

Models are available on HuggingFace.
Source code is located in the src directory.
Specific requirements (e.g., GPU, CUDA, Python versions) are not explicitly detailed in the README but are typical for large deep learning models.

Highlighted Details

UniSpeech-SAT large and WavLM large models demonstrate state-of-the-art performance on speaker verification benchmarks, outperforming models like HuBERT and Wav2Vec2.0.
Models show competitive results in speech separation and speaker diarization tasks, with UniSpeech-SAT large and WavLM large achieving strong performance.
A wide array of pre-trained models are available, covering different languages and training dataset sizes.

Maintenance & Community

Contact for issues: Submit a GitHub issue.
Contact for other communications: Yu Wu (yuwu1@microsoft.com).

Licensing & Compatibility

The project is licensed under the terms found in the LICENSE file.
Portions of the code are based on the FAIRSEQ project.

Limitations & Caveats

The README does not specify exact installation or setup instructions beyond pointing to the src directory and HuggingFace integration.
Detailed hardware or software prerequisites for running the models are not explicitly listed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

3 stars in the last 30 days

Explore Similar Projects

WenetSpeech-Yue by ASLP-lab

Large-scale Cantonese speech dataset and processing pipeline

Created 4 months ago

Updated 1 month ago

Awesome-Speaker-Diarization by DongKeon

Collection of speaker diarization papers

Created 2 years ago

Updated 7 months ago

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

Large-Audio-Models by liusongxiang

Curated list of large audio models

Created 2 years ago

Updated 1 year ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

deepspeech-german by AASHISHAG

ASR module using Mozilla DeepSpeech for German speech

Created 6 years ago

Updated 2 years ago

edgedict by theblackcat102

RNN-Transducer for online speech recognition

Created 5 years ago

Updated 4 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

huggingsound by jonatasgrosman

Speech toolkit for speech-related tasks based on Hugging Face's tools

Created 3 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

HierSpeechpp by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

Created 2 years ago

Updated 1 year ago

Starred by

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

List of resources for speaker diarization

Created 7 years ago

Updated 5 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

1 more.

parler-tts by huggingface

TTS library for high-quality speech generation, based on a research paper

Created 1 year ago

Updated 1 year ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

4 more.

StyleTTS2 by yl4579

Text-to-speech model achieving human-level synthesis

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.