asv-subtools by Snowdar

PyTorch/Kaldi toolkit for speaker recognition and language ID research

Created 5 years ago

633 stars

Top 52.4% on SourcePulse

Project Summary

ASV-Subtools is an open-source toolkit for speaker recognition and language identification, built on PyTorch and Kaldi. It offers a modular approach with numerous tools for feature extraction, model training, and backend scoring, catering to researchers and engineers in the speech processing domain. The toolkit aims to provide a flexible and efficient framework for developing and experimenting with various speaker recognition models.

How It Works

ASV-Subtools leverages Kaldi for acoustic feature extraction and backend scoring, while PyTorch is used for flexible model building and custom training. The project is structured into three main branches: basic shell scripts (Kaldi-based), Kaldi for core model training (i-vectors, x-vectors), and PyTorch for custom model development. PyTorch models must inherit from libs.nnet.framework.TopVirtualNnet to access default functionalities like auto-saving and utterance embedding extraction.

Quick Start & Requirements

Installation: Requires installing Kaldi first, then cloning ASV-Subtools into your project directory. Python dependencies are installed via pip install -r subtools/requirements.txt.
Prerequisites: Kaldi, PyTorch (>=1.10), CUDA (>=11.1 recommended for PyTorch), Python 3.8+, numpy, thop, pandas, progressbar2, matplotlib. Multi-GPU training requires NCCL and potentially Horovod/OpenMPI.
Setup: Detailed installation instructions for Kaldi and Python dependencies are provided.
Links: Kaldi Installation, ASV-Subtools GitHub

Highlighted Details

Supports various modern speaker embedding models including ECAPA-TDNN, ResNet, and Conformer X-vectors.
Offers extensive data augmentation techniques (reverb, noise, Mixup, Specaugment) and advanced pooling methods (Attentive Statistics Pooling, Multi-Head Attention).
Includes multiple advanced loss functions (AM-Softmax, AAM-Softmax, Ring Loss) and optimizers (Lookahead, RAdam, Ralamb).
Provides recipes for popular datasets like VoxCeleb, OLR Challenges, and CNSRC, with reported EERs.

Maintenance & Community

Developed by XMU Speech Lab and TalentedSoft.
Maintained by Tao Jiang.
Community support via GitHub issues and email. WeChat group available via QR code in README.

Licensing & Compatibility

Licensed under Apache 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The README notes that Specaugment is not yet stable with multi-GPU training. Training large models on datasets like VoxCeleb2 can be time-consuming (1-2 days per model on 4 V100 GPUs). Some older scripts and results may have been removed.

asv-subtools by Snowdar

Explore Similar Projects

awesome-kaldi by YoavRamon

SLAM-LLM by X-LANCE

TensorflowASR by Z-yq

espresso by freewym

athena by athena-team

ultravox by fixie-ai

icefall by k2-fsa

dl-colab-notebooks by tugstugi

speech-to-text-wavenet by buriburisuri

TTS by mozilla

FunASR by modelscope

speechbrain by speechbrain