senko  by narcotic-sh

Speaker diarization pipeline for rapid, accurate audio analysis

Created 8 months ago
256 stars

Top 98.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

Senko is a high-performance speaker diarization pipeline designed for speed and accuracy, processing an hour of audio in seconds on modern hardware. It targets engineers and researchers needing efficient audio segmentation, offering significant speedups over traditional methods and powering applications like the Zanshin media player.

How It Works

Senko optimizes the 3D-Speaker diarization pipeline through several key modifications for enhanced speed and efficiency. It employs either Pyannote segmentation-3.0 or Silero VAD for precise voice activity detection. Feature extraction uses Fbank, performed upfront and accelerated on GPU via kaldifeat for NVIDIA or efficiently on CPU using all cores otherwise. Speaker embeddings are generated using batched inference of the CAM++ model. Clustering is performed efficiently, leveraging RAPIDS for GPU acceleration on NVIDIA hardware (CUDA compute capability 7.0+) or UMAP+HDBSCAN. On macOS, models are run through CoreML for hardware acceleration on Apple Silicon.

Quick Start & Requirements

Installation involves creating a Python 3.13 virtual environment using uv and then installing via pip:

  • NVIDIA GPUs (CUDA >= 7.5): uv pip install "git+https://github.com/narcotic-sh/senko.git[nvidia]"
  • NVIDIA GPUs (older CUDA < 7.5): uv pip install "git+https://github.com/narcotic-sh/senko.git[nvidia-old]"
  • Mac (macOS 14+) / CPU: uv pip install "git+https://github.com/narcotic-sh/senko.git"

Prerequisites include gcc/clang (Linux/WSL) or Xcode Command Line Tools (macOS). NVIDIA installations require CUDA 12 capable drivers. See examples/diarize.py for usage examples and DOCS.md for detailed documentation.

Highlighted Details

  • Performance: Achieves exceptional speed, processing 1 hour of audio in approximately 5 seconds on an RTX 4090 paired with a Ryzen 9 7950X, and 7.7 seconds on an Apple M3.
  • Accuracy: Reports strong benchmark scores, including a best score of 13.5% DER on VoxConverse, 13.3% on AISHELL-4, and 26.5% on AMI-IHM.
  • Integration: Serves as the core engine for the Zanshin media player and is integrated into other applications like reaper_speech_diarizer, scribe (for speaker-attributed transcripts), and verbatim (for multilingual speech-to-text).
  • Hardware Acceleration: Leverages GPU acceleration via RAPIDS for clustering on compatible NVIDIA hardware and utilizes CoreML on Apple Silicon for VAD and embeddings, optimizing performance across platforms.

Maintenance & Community

A Discord server is available for community support, feature suggestions, and development discussions. Specific details on core contributors or sponsorships are not provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license. Compatibility is noted for Linux, macOS, and WSL. Native Windows installation details are in WINDOWS.md and may have specific limitations.

Limitations & Caveats

Performance is sensitive to audio recording quality; background noise or low fidelity degrades accuracy. Highly similar voices may be misclassified, and distinct recording conditions for the same speaker can lead to multiple speaker detections. The pipeline currently does not output overlapping speaker segments, though this is a planned feature.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

0.6%
2k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 months ago
Updated 2 months ago
Feedback? Help us improve.