whisper-plus  by kadirnar

Speech-to-text toolkit for enhanced audio processing

created 1 year ago
1,868 stars

Top 23.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

WhisperPlus is a Python library designed to streamline and enhance audio and video processing tasks, primarily focusing on speech-to-text transcription, summarization, speaker diarization, and conversational AI. It targets developers and researchers needing efficient, multi-functional tools for multimedia content analysis.

How It Works

WhisperPlus leverages state-of-the-art models from Hugging Face Transformers, including Whisper variants for transcription and BART for summarization. It integrates advanced techniques like FlashAttention-2 for faster attention computation and quantization (4-bit via BitsAndBytes and HQQ) for reduced memory footprint and improved inference speed. The library also supports Apple's MLX framework for efficient execution on Apple Silicon.

Quick Start & Requirements

  • Installation: pip install whisperplus git+https://github.com/huggingface/transformers and pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.x, PyTorch, Hugging Face Hub access. Specific pipelines may require additional dependencies (e.g., sentence-transformers, ctransformers, langchain for RAG; moviepy, imagemagick for auto-captioning; pyannote models for diarization). GPU with CUDA is recommended for optimal performance.
  • Resources: Model downloads can be substantial. Quantization and FlashAttention-2 significantly reduce VRAM requirements.
  • Docs: HuggingFace Model Hub

Highlighted Details

  • Supports YouTube URL to MP3 conversion.
  • Offers RAG (Retrieval-Augmented Generation) chatbots for querying video content using LanceDB or AutoLLM.
  • Includes Text-to-Speech capabilities using models like Suno Bark.
  • Provides an AutoCaption pipeline for videos.
  • Integrates speaker diarization using pyannote models.

Maintenance & Community

The project is hosted on GitHub and appears to be actively maintained. Community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Speaker diarization requires explicit confirmation of licensing permissions for the pyannote models. Auto-captioning requires imagemagick and specific system configurations. Some RAG pipelines require API keys for certain LLM providers.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
45 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.