speechlib by NavodPeiris

Audio AI library for speaker-aware transcription

Created 2 years ago

265 stars

Top 96.3% on SourcePulse

Project Summary

Speechlib is a Python library that performs speaker diarization, speaker recognition, and transcription on audio files to generate transcripts with identified speaker names. It serves researchers and developers by providing a unified pipeline for extracting structured, speaker-attributed information from audio, simplifying analysis and content understanding.

How It Works

The library employs a multi-stage process starting with audio preprocessing: converting various formats to WAV, ensuring mono channel, and re-encoding to 16-bit PCM. It then utilizes pyannote-audio for speaker diarization and faster-whisper (or other Whisper variants/AssemblyAI) for transcription. Speaker recognition is performed by matching voice samples from a user-provided voices_folder to assign names to transcribed segments.

Quick Start & Requirements

Installation: pip install speechlib
Prerequisites: Python 3.8+, GPU with CUDA 11 (including cuBLAS and cuDNN 8), and installed NVIDIA libraries. A Hugging Face access token is needed for gated models like pyannote/speaker-diarization@2.1.
Setup: GPU setup requires installing CUDA and NVIDIA drivers. Google Colab users can install CUDA dependencies via !apt install libcublas11.
Links: Official NVIDIA documentation for CUDA installation. Recall.ai is mentioned as an alternative transcription API.

Highlighted Details

Combines speaker diarization, recognition, and transcription in a single workflow.
Offers audio preprocessing: format conversion, mono channel, 16-bit PCM re-encoding.
Supports multiple transcription engines: faster-whisper (with optional quantization), custom Whisper models, Hugging Face models, and AssemblyAI.
GPU performance metrics (6m 36s audio, no quantization): faster-whisper "tiny" model transcribes in ~64s (diarization 24s, recognition 10s); "large" model in ~343s.

Maintenance & Community

No specific details regarding maintenance, community channels (e.g., Discord, Slack), or notable contributors were present in the provided README snippet.

Licensing & Compatibility

No explicit license information was found in the provided README snippet.

Limitations & Caveats

Running on Windows without administrator privileges may cause an OSError: [WinError 1314]. Quantization, while speeding up faster-whisper, may reduce transcription accuracy. Access to certain gated Hugging Face models requires explicit user permission and an API token. Performance benchmarks are from Google Colab tests and exclude model download times.

speechlib by NavodPeiris

Explore Similar Projects

MOSS-Transcribe-Diarize by OpenMOSS

unified-audio by alibaba

SpeechGPT-2.0-preview by OpenMOSS

millet by pretyflaco

Whisper-transcription_and_diarization-speaker-identification- by lablab-ai

awesome-ai-voice by wildminder

VITA-Audio by VITA-MLLM

jt-live-whisper by jasoncheng7115

aTrain by aTrainTranscription

whisper-plus by kadirnar

Scriberr by rishikanthc

Kimi-Audio by MoonshotAI