VBx by BUTSpeechFIT

Speaker diarization using variational Bayes HMM over x-vectors

Created 5 years ago

275 stars

Top 94.1% on SourcePulse

Project Summary

This repository provides a speaker diarization recipe, VBx, which uses Variational Bayes Hidden Markov Models (HMM) over x-vectors. It's designed for researchers and practitioners working with speech processing tasks, particularly those involving speaker segmentation and identification in challenging audio datasets like CALLHOME, AMI, and DIHARD II. The primary benefit is an advanced diarization approach that leverages Bayesian methods for improved accuracy.

How It Works

VBx employs a multi-stage process: first, it computes x-vectors, which are fixed-dimensional speaker embeddings. Second, it performs agglomerative hierarchical clustering on these x-vectors to generate an initial speaker segmentation. Finally, it refines this segmentation by applying a Variational Bayes HMM over the x-vector sequences, offering a probabilistic approach to speaker diarization that can handle overlapping speech and varying speaker counts more robustly than traditional methods.

Quick Start & Requirements

Installation: Create a conda environment (conda create -n VBx python=3.9), activate it (conda activate VBx), clone the repository, install the package (pip install -e .), and initialize the dscore submodule (git submodule init && git submodule update).
Prerequisites: Python 3.9. Datasets (CALLHOME, AMI, DIHARD II) need to be downloaded separately. For AMI, VAD segments and reference rttms are provided. Pre-trained x-vector extractors are included, with training recipes available separately.
Example: Run ./run_example.sh.
Links: VBx GitHub Repository

Highlighted Details

Achieves competitive results on standard diarization benchmarks (CALLHOME, AMI, DIHARD II), with reported DERs as low as 1.56% on AMI Mix-Headset (Forgiving protocol, dev set).
Offers an alternative 'random' initialization for VBx, which is faster for long recordings (>30 minutes) at a potential slight performance cost compared to AHC initialization.
Includes recipes for specific datasets and challenge tracks (e.g., VoxSRC-20).

Maintenance & Community

The project is associated with Brno University of Technology (BUT) SpeechFIT.
Contact emails for inquiries: landini@fit.vutbr.cz or mireia@fit.vutbr.cz.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This license permits commercial use and modification, with the main requirement being compliance with the license terms, including attribution and preservation of copyright notices.

Limitations & Caveats

Agglomerative hierarchical clustering (AHC) initialization can be slow for recordings longer than 30 minutes, though a faster 'random' initialization is provided as an alternative.
Results are obtained using oracle VAD, which may not reflect real-world performance where VAD is also an estimation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Awesome-Speaker-Diarization by DongKeon

Collection of speaker diarization papers

Created 2 years ago

Updated 3 months ago

StyleSpeech by KevinMIN95

Multi-speaker adaptive TTS generation

Created 4 years ago

Updated 3 years ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 4 months ago

Updated 3 months ago

EEND by hitachi-speech

Speaker diarization research paper using end-to-end neural networks

Created 6 years ago

Updated 4 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

UniSpeech by microsoft

Speech models for self-supervised learning

Created 4 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

fish-diffusion by fishaudio

TTS/SVS/SVC framework for voice generation tasks

Created 2 years ago

Updated 6 months ago

diart by juanmc2005

Real-time audio applications framework

Created 4 years ago

Updated 7 months ago

wespeaker by wenet-e2e

Speaker toolkit for verification, recognition, and diarization research

Created 4 years ago

Updated 1 day ago

3D-Speaker by modelscope

Toolkit for speaker verification, recognition, and diarization

Created 2 years ago

Updated 1 month ago

Starred by

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

List of resources for speaker diarization

Created 6 years ago

Updated 1 month ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

1 more.

parler-tts by huggingface

TTS library for high-quality speech generation, based on a research paper

Created 1 year ago

Updated 9 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

4 more.

StyleTTS2 by yl4579

Text-to-speech model achieving human-level synthesis

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.