mazinger  by bakrianoo

AI-powered video dubbing pipeline

Created 1 month ago
399 stars

Top 72.4% on SourcePulse

GitHubView on GitHub
Project Summary

A video dubbing pipeline, Mazinger automates the entire process from video download to producing a fully dubbed audio or video file. It targets users needing efficient video localization, offering a single-command solution with advanced voice cloning and subtitle generation capabilities, significantly reducing manual effort.

How It Works

Mazinger orchestrates ten distinct stages—download, transcribe, thumbnails, describe, review, translate, re-segment, speak, assemble, and subtitle—into a cohesive pipeline. This modular architecture allows for independent stage execution and automatic resumption of interrupted workflows through caching. It integrates multiple ASR (Whisper, faster-whisper, WhisperX) and TTS (Qwen3-TTS, Chatterbox) models, optionally enhanced by LLMs for tasks like summarization and transcription refinement, providing flexibility and robust performance.

Quick Start & Requirements

  • Primary install: pip install mazinger (base functionality). Optional extras for local transcription ([transcribe-faster], [transcribe-whisperx]) and TTS ([tts], [tts-chatterbox]) are available.
  • Prerequisites: Python 3.10+, ffmpeg installed and in PATH, OpenAI API key (for LLM stages), CUDA GPU (recommended for local transcription/TTS).
  • Documentation: Comprehensive guides are available in the docs/ directory, covering installation, quick start, pipeline details, CLI/API references, voice profiles, and configuration.

Highlighted Details

  • End-to-end pipeline with ten modular, independently runnable stages.
  • Automatic resumption of interrupted runs with caching for completed stages.
  • Flexible voice cloning options: custom audio samples, HuggingFace profiles, 16 pre-defined voice themes, or automatic cloning from source audio.
  • Support for multiple ASR and TTS backends, including WhisperX for word-level alignment and Qwen3-TTS/Chatterbox for speech synthesis.
  • Optional LLM integration for advanced features like video summarization and ASR refinement.
  • Subtitle embedding into video output with customizable styling (fonts, size).

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap were found in the provided README.

Licensing & Compatibility

The project is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Dependency conflicts exist between certain TTS backends (Qwen/Chatterbox) and WhisperX, necessitating careful environment management as detailed in the documentation. A CUDA-enabled GPU is recommended for optimal performance of local transcription and TTS tasks. An OpenAI API key is required for stages leveraging LLM capabilities.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
3
Star History
400 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.