speechlib  by NavodPeiris

Audio AI library for speaker-aware transcription

Created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Speechlib is a Python library that performs speaker diarization, speaker recognition, and transcription on audio files to generate transcripts with identified speaker names. It serves researchers and developers by providing a unified pipeline for extracting structured, speaker-attributed information from audio, simplifying analysis and content understanding.

How It Works

The library employs a multi-stage process starting with audio preprocessing: converting various formats to WAV, ensuring mono channel, and re-encoding to 16-bit PCM. It then utilizes pyannote-audio for speaker diarization and faster-whisper (or other Whisper variants/AssemblyAI) for transcription. Speaker recognition is performed by matching voice samples from a user-provided voices_folder to assign names to transcribed segments.

Quick Start & Requirements

  • Installation: pip install speechlib
  • Prerequisites: Python 3.8+, GPU with CUDA 11 (including cuBLAS and cuDNN 8), and installed NVIDIA libraries. A Hugging Face access token is needed for gated models like pyannote/speaker-diarization@2.1.
  • Setup: GPU setup requires installing CUDA and NVIDIA drivers. Google Colab users can install CUDA dependencies via !apt install libcublas11.
  • Links: Official NVIDIA documentation for CUDA installation. Recall.ai is mentioned as an alternative transcription API.

Highlighted Details

  • Combines speaker diarization, recognition, and transcription in a single workflow.
  • Offers audio preprocessing: format conversion, mono channel, 16-bit PCM re-encoding.
  • Supports multiple transcription engines: faster-whisper (with optional quantization), custom Whisper models, Hugging Face models, and AssemblyAI.
  • GPU performance metrics (6m 36s audio, no quantization): faster-whisper "tiny" model transcribes in ~64s (diarization 24s, recognition 10s); "large" model in ~343s.

Maintenance & Community

No specific details regarding maintenance, community channels (e.g., Discord, Slack), or notable contributors were present in the provided README snippet.

Licensing & Compatibility

No explicit license information was found in the provided README snippet.

Limitations & Caveats

Running on Windows without administrator privileges may cause an OSError: [WinError 1314]. Quantization, while speeding up faster-whisper, may reduce transcription accuracy. Access to certain gated Hugging Face models requires explicit user permission and an API token. Performance benchmarks are from Google Colab tests and exclude model download times.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.