pase by santi-pdp

Speech representation learning for diverse tasks

Created 7 years ago

445 stars

Top 67.3% on SourcePulse

Project Summary

PASE (Problem Agnostic Speech Encoder) and PASE+ are self-supervised speech waveform encoders designed for feature extraction and pre-training. They are suitable for tasks like Automatic Speech Recognition (ASR), speaker recognition, emotion recognition, voice conversion, and Text-to-Speech (TTS). The primary benefit is their ability to learn robust speech representations applicable across diverse speech processing tasks without task-specific supervision.

How It Works

PASE models are trained using a worker/minion framework in a self-supervised manner. The core idea is to train an encoder (PASE) to predict the outputs of multiple "worker" networks, each trained on a different self-supervised task. This multi-task learning approach allows the encoder to capture a wide range of speech characteristics, leading to more generalizable representations. The PASE+ variant further enhances this by incorporating more sophisticated data augmentation techniques and training strategies.

Quick Start & Requirements

Installation: pip install -r requirements.txt followed by python setup.py install.
Prerequisites: PyTorch 1.0+, Torchvision 0.2+. For data augmentation during training, codec2 must be built from source, and pycodec2 installed (pip install pycodec2). Ensure LD_LIBRARY_PATH is set correctly if pycodec2 loading fails. CUDA version compatibility might require editing cupy-cuda100 in requirements.txt.
Pre-trained Model: Downloadable checkpoint (FE_e199.ckpt) and configuration file (cfg/frontend/PASE+.cfg) are available for direct use as a PyTorch nn.Module.
Setup Time: Not explicitly stated, but self-supervised training can be resource-intensive.

Highlighted Details

PASE models can be directly integrated into existing PyTorch models and fine-tuned.
The framework supports custom data preparation for self-supervised training, including generating dataset configuration and training statistics files.
Extensive data augmentation options are available, including overlap, additive noise, amplitude clipping, waveform chopping, resampling, and frequency band-drop.
An example ASR experiment using the TIMIT dataset with Kaldi for HMM decoding is provided, achieving a Phoneme Error Rate (PER) of 17.2%.

Maintenance & Community

The repository is maintained by santi-pdp.
Citation details for both PASE and PASE+ are provided for academic use.

Licensing & Compatibility

The license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify the exact license, which may impact commercial use.
Training PASE models from scratch requires significant data preparation and computational resources.
Some dependencies like codec2 require building from source, which can add complexity to the setup.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

WenetSpeech-Yue by ASLP-lab

Large-scale Cantonese speech dataset and processing pipeline

Created 4 months ago

Updated 1 month ago

GigaAM by salute-developers

Foundational models for Russian speech processing

Created 1 year ago

Updated 2 weeks ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

UniSpeech by microsoft

Speech models for self-supervised learning

Created 4 years ago

Updated 1 year ago

MMVC_Trainer by isletennos

Voice conversion trainer for real-time voice changer

Created 3 years ago

Updated 1 year ago

awesome-kaldi by YoavRamon

List of Kaldi ASR resources

Created 7 years ago

Updated 3 years ago

Starred by

Alexander Borzunov

Alexander Borzunov(Research Scientist at OpenAI).

speech_course by yandexdataschool

Speech processing course materials

Created 4 years ago

Updated 5 months ago

neural_sp by hirofumi0810

End-to-end speech processing toolkit

Created 8 years ago

Updated 4 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

HierSpeechpp by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

Created 2 years ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 5 months ago

Updated 3 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral),

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs), and

3 more.

espnet by espnet

End-to-end speech processing toolkit for various speech tasks

Created 8 years ago

Updated 3 weeks ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

Few-shot voice cloning and TTS web UI

Created 2 years ago

Updated 1 week ago

Feedback? Help us improve.