whisper-plus by kadirnar

Speech-to-text toolkit for enhanced audio processing

Created 2 years ago

1,927 stars

Top 22.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

WhisperPlus is a Python library designed to streamline and enhance audio and video processing tasks, primarily focusing on speech-to-text transcription, summarization, speaker diarization, and conversational AI. It targets developers and researchers needing efficient, multi-functional tools for multimedia content analysis.

How It Works

WhisperPlus leverages state-of-the-art models from Hugging Face Transformers, including Whisper variants for transcription and BART for summarization. It integrates advanced techniques like FlashAttention-2 for faster attention computation and quantization (4-bit via BitsAndBytes and HQQ) for reduced memory footprint and improved inference speed. The library also supports Apple's MLX framework for efficient execution on Apple Silicon.

Quick Start & Requirements

Installation: pip install whisperplus git+https://github.com/huggingface/transformers and pip install flash-attn --no-build-isolation.
Prerequisites: Python 3.x, PyTorch, Hugging Face Hub access. Specific pipelines may require additional dependencies (e.g., sentence-transformers, ctransformers, langchain for RAG; moviepy, imagemagick for auto-captioning; pyannote models for diarization). GPU with CUDA is recommended for optimal performance.
Resources: Model downloads can be substantial. Quantization and FlashAttention-2 significantly reduce VRAM requirements.
Docs: HuggingFace Model Hub

Highlighted Details

Supports YouTube URL to MP3 conversion.
Offers RAG (Retrieval-Augmented Generation) chatbots for querying video content using LanceDB or AutoLLM.
Includes Text-to-Speech capabilities using models like Suno Bark.
Provides an AutoCaption pipeline for videos.
Integrates speaker diarization using pyannote models.

Maintenance & Community

The project is hosted on GitHub and appears to be actively maintained. Community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Speaker diarization requires explicit confirmation of licensing permissions for the pyannote models. Auto-captioning requires imagemagick and specific system configurations. Some RAG pipelines require API keys for certain LLM providers.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days