3D-Speaker by modelscope

Toolkit for speaker verification, recognition, and diarization

Created 2 years ago

2,705 stars

Top 17.4% on SourcePulse

Project Summary

3D-Speaker is an open-source toolkit addressing single- and multi-modal speaker verification, recognition, and diarization. It provides pretrained models and a large-scale dataset (3D-Speaker-Dataset) to facilitate research in speech representation disentanglement. The toolkit is beneficial for researchers and developers working on speaker-related audio processing tasks.

How It Works

The toolkit supports various speaker verification models, including Res2Net, ResNet34, ECAPA-TDNN, ERes2Net, ERes2NetV2, and CAM++. It also offers recipes for self-supervised learning approaches like SDPN and RDINO. For speaker diarization, it includes a pipeline with modules for voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering, with optional overlap detection and multimodal fusion capabilities.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n 3D-Speaker python=3.8), activate it (conda activate 3D-Speaker), and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.8. Specific model training and inference may have additional dependencies not explicitly listed but are generally standard for deep learning audio tasks.
Running Experiments: Example commands are provided for various tasks like speaker verification (e.g., bash run.sh in egs/3dspeaker/sv-eres2netv2/) and speaker diarization (bash run_audio.sh or bash run_video.sh in egs/3dspeaker/speaker-diarization/).
Inference: Pretrained models are available on ModelScope. Inference can be performed using provided Python scripts (e.g., python speakerlab/bin/infer_sv.py --model_id $model_id).
Links:
- ModelScope pretrained models: https://www.modelscope.cn/models?page=1&tasks=speaker-verification&type=audio
- 3D-Speaker Dataset: https://3dspeaker.github.io/

Highlighted Details

Achieves state-of-the-art results on benchmarks like VoxCeleb, CNCeleb, and 3D-Speaker datasets for speaker verification, with ERes2Net-large reporting 0.52% EER on VoxCeleb1-O.
Offers competitive performance in speaker diarization, outperforming pyannote.audio on Aishell-4 with 10.30% DER.
Supports multimodal speaker diarization by fusing audio and video inputs.
Provides ONNX Runtime for efficient inference.

Maintenance & Community

The project is actively updated, with recent additions including diarization recipes, new pretrained models (ERes2NetV2, SDPN), and multimodal/semantic modules. Contact information via email is provided for inquiries.

Licensing & Compatibility

3D-Speaker is released under the Apache License 2.0. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The README does not specify hardware requirements (e.g., GPU, CUDA versions) for training or inference, which may be significant for large models. While pretrained models are available, the setup time and resource footprint for training custom models are not detailed.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

78 stars in the last 30 days