pase  by santi-pdp

Speech representation learning for diverse tasks

Created 6 years ago
444 stars

Top 67.6% on SourcePulse

GitHubView on GitHub
Project Summary

PASE (Problem Agnostic Speech Encoder) and PASE+ are self-supervised speech waveform encoders designed for feature extraction and pre-training. They are suitable for tasks like Automatic Speech Recognition (ASR), speaker recognition, emotion recognition, voice conversion, and Text-to-Speech (TTS). The primary benefit is their ability to learn robust speech representations applicable across diverse speech processing tasks without task-specific supervision.

How It Works

PASE models are trained using a worker/minion framework in a self-supervised manner. The core idea is to train an encoder (PASE) to predict the outputs of multiple "worker" networks, each trained on a different self-supervised task. This multi-task learning approach allows the encoder to capture a wide range of speech characteristics, leading to more generalizable representations. The PASE+ variant further enhances this by incorporating more sophisticated data augmentation techniques and training strategies.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt followed by python setup.py install.
  • Prerequisites: PyTorch 1.0+, Torchvision 0.2+. For data augmentation during training, codec2 must be built from source, and pycodec2 installed (pip install pycodec2). Ensure LD_LIBRARY_PATH is set correctly if pycodec2 loading fails. CUDA version compatibility might require editing cupy-cuda100 in requirements.txt.
  • Pre-trained Model: Downloadable checkpoint (FE_e199.ckpt) and configuration file (cfg/frontend/PASE+.cfg) are available for direct use as a PyTorch nn.Module.
  • Setup Time: Not explicitly stated, but self-supervised training can be resource-intensive.

Highlighted Details

  • PASE models can be directly integrated into existing PyTorch models and fine-tuned.
  • The framework supports custom data preparation for self-supervised training, including generating dataset configuration and training statistics files.
  • Extensive data augmentation options are available, including overlap, additive noise, amplitude clipping, waveform chopping, resampling, and frequency band-drop.
  • An example ASR experiment using the TIMIT dataset with Kaldi for HMM decoding is provided, achieving a Phoneme Error Rate (PER) of 17.2%.

Maintenance & Community

  • The repository is maintained by santi-pdp.
  • Citation details for both PASE and PASE+ are provided for academic use.

Licensing & Compatibility

  • The license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

  • The README does not specify the exact license, which may impact commercial use.
  • Training PASE models from scratch requires significant data preparation and computational resources.
  • Some dependencies like codec2 require building from source, which can add complexity to the setup.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.