Speech-to-text toolkit for enhanced audio processing
Top 23.7% on sourcepulse
WhisperPlus is a Python library designed to streamline and enhance audio and video processing tasks, primarily focusing on speech-to-text transcription, summarization, speaker diarization, and conversational AI. It targets developers and researchers needing efficient, multi-functional tools for multimedia content analysis.
How It Works
WhisperPlus leverages state-of-the-art models from Hugging Face Transformers, including Whisper variants for transcription and BART for summarization. It integrates advanced techniques like FlashAttention-2 for faster attention computation and quantization (4-bit via BitsAndBytes and HQQ) for reduced memory footprint and improved inference speed. The library also supports Apple's MLX framework for efficient execution on Apple Silicon.
Quick Start & Requirements
pip install whisperplus git+https://github.com/huggingface/transformers
and pip install flash-attn --no-build-isolation
.sentence-transformers
, ctransformers
, langchain
for RAG; moviepy
, imagemagick
for auto-captioning; pyannote
models for diarization). GPU with CUDA is recommended for optimal performance.Highlighted Details
pyannote
models.Maintenance & Community
The project is hosted on GitHub and appears to be actively maintained. Community engagement channels are not explicitly listed in the README.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, which is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
Speaker diarization requires explicit confirmation of licensing permissions for the pyannote
models. Auto-captioning requires imagemagick
and specific system configurations. Some RAG pipelines require API keys for certain LLM providers.
2 weeks ago
1 day