asv-subtools  by Snowdar

PyTorch/Kaldi toolkit for speaker recognition and language ID research

created 5 years ago
619 stars

Top 54.1% on sourcepulse

GitHubView on GitHub
Project Summary

ASV-Subtools is an open-source toolkit for speaker recognition and language identification, built on PyTorch and Kaldi. It offers a modular approach with numerous tools for feature extraction, model training, and backend scoring, catering to researchers and engineers in the speech processing domain. The toolkit aims to provide a flexible and efficient framework for developing and experimenting with various speaker recognition models.

How It Works

ASV-Subtools leverages Kaldi for acoustic feature extraction and backend scoring, while PyTorch is used for flexible model building and custom training. The project is structured into three main branches: basic shell scripts (Kaldi-based), Kaldi for core model training (i-vectors, x-vectors), and PyTorch for custom model development. PyTorch models must inherit from libs.nnet.framework.TopVirtualNnet to access default functionalities like auto-saving and utterance embedding extraction.

Quick Start & Requirements

  • Installation: Requires installing Kaldi first, then cloning ASV-Subtools into your project directory. Python dependencies are installed via pip install -r subtools/requirements.txt.
  • Prerequisites: Kaldi, PyTorch (>=1.10), CUDA (>=11.1 recommended for PyTorch), Python 3.8+, numpy, thop, pandas, progressbar2, matplotlib. Multi-GPU training requires NCCL and potentially Horovod/OpenMPI.
  • Setup: Detailed installation instructions for Kaldi and Python dependencies are provided.
  • Links: Kaldi Installation, ASV-Subtools GitHub

Highlighted Details

  • Supports various modern speaker embedding models including ECAPA-TDNN, ResNet, and Conformer X-vectors.
  • Offers extensive data augmentation techniques (reverb, noise, Mixup, Specaugment) and advanced pooling methods (Attentive Statistics Pooling, Multi-Head Attention).
  • Includes multiple advanced loss functions (AM-Softmax, AAM-Softmax, Ring Loss) and optimizers (Lookahead, RAdam, Ralamb).
  • Provides recipes for popular datasets like VoxCeleb, OLR Challenges, and CNSRC, with reported EERs.

Maintenance & Community

  • Developed by XMU Speech Lab and TalentedSoft.
  • Maintained by Tao Jiang.
  • Community support via GitHub issues and email. WeChat group available via QR code in README.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The README notes that Specaugment is not yet stable with multi-GPU training. Training large models on datasets like VoxCeleb2 can be time-consuming (1-2 days per model on 4 V100 GPUs). Some older scripts and results may have been removed.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.