VBx  by BUTSpeechFIT

Speaker diarization using variational Bayes HMM over x-vectors

Created 5 years ago
275 stars

Top 94.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a speaker diarization recipe, VBx, which uses Variational Bayes Hidden Markov Models (HMM) over x-vectors. It's designed for researchers and practitioners working with speech processing tasks, particularly those involving speaker segmentation and identification in challenging audio datasets like CALLHOME, AMI, and DIHARD II. The primary benefit is an advanced diarization approach that leverages Bayesian methods for improved accuracy.

How It Works

VBx employs a multi-stage process: first, it computes x-vectors, which are fixed-dimensional speaker embeddings. Second, it performs agglomerative hierarchical clustering on these x-vectors to generate an initial speaker segmentation. Finally, it refines this segmentation by applying a Variational Bayes HMM over the x-vector sequences, offering a probabilistic approach to speaker diarization that can handle overlapping speech and varying speaker counts more robustly than traditional methods.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create -n VBx python=3.9), activate it (conda activate VBx), clone the repository, install the package (pip install -e .), and initialize the dscore submodule (git submodule init && git submodule update).
  • Prerequisites: Python 3.9. Datasets (CALLHOME, AMI, DIHARD II) need to be downloaded separately. For AMI, VAD segments and reference rttms are provided. Pre-trained x-vector extractors are included, with training recipes available separately.
  • Example: Run ./run_example.sh.
  • Links: VBx GitHub Repository

Highlighted Details

  • Achieves competitive results on standard diarization benchmarks (CALLHOME, AMI, DIHARD II), with reported DERs as low as 1.56% on AMI Mix-Headset (Forgiving protocol, dev set).
  • Offers an alternative 'random' initialization for VBx, which is faster for long recordings (>30 minutes) at a potential slight performance cost compared to AHC initialization.
  • Includes recipes for specific datasets and challenge tracks (e.g., VoxSRC-20).

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0. This license permits commercial use and modification, with the main requirement being compliance with the license terms, including attribution and preservation of copyright notices.

Limitations & Caveats

  • Agglomerative hierarchical clustering (AHC) initialization can be slow for recordings longer than 30 minutes, though a faster 'random' initialization is provided as an alternative.
  • Results are obtained using oracle VAD, which may not reflect real-world performance where VAD is also an estimation.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

0.2%
2k
List of resources for speaker diarization
Created 6 years ago
Updated 1 month ago
Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.