audiobook_maker by JarodMica

Audiobook creator using open-source TTS/STS models

Created 2 years ago

516 stars

Top 60.7% on SourcePulse

Project Summary

This project provides a Windows GUI application for creating audiobooks using deep learning text-to-speech (TTS) and speech-to-speech (S2S) models. It targets users who want to generate high-quality audiobooks with features like multi-speaker support, in-progress saving, and bulk editing, leveraging advancements in AI for a more seamless workflow.

How It Works

The application employs a modular architecture, moving from a monolithic design to a closer approximation of Model-View-Controller (MVC). This separation of concerns into view.py (GUI), controller.py (logic), and model.py (functional code) enhances maintainability and facilitates the integration of new TTS and S2S engines. Each engine is configured dynamically, requiring only a defined loading and generation procedure that returns an audio path to model.py.

Quick Start & Requirements

Installation: Manual installation involves cloning the repository, setting up a Python 3.11 virtual environment, installing requirements (pip install -r requirements.txt), initializing and updating submodules, and launching the controller (python src/controller.py).
Prerequisites: An NVIDIA GPU with at least 8GB VRAM is recommended. CUDA 12.1 toolkit is required for RVC. Python 3.11, Git, and FFmpeg are also necessary. Specific TTS/S2S engines (TortoiseTTS, StyleTTS 2, F5-TTS, RVC) have their own installation steps, often involving submodule updates and specific pip installs.
Resources: Torch installation can be substantial. Some models, like F5-TTS, may incur additional downloads.
Links: CUDA Toolkit Archive, Python 3.11, Git, FFmpeg, Torch Install Check.

Highlighted Details

Supports multiple TTS engines: TortoiseTTS, StyleTTS 2, F5-TTS, and XTTS (planned).
Integrates S2S capabilities via RVC.
Features include multi-speaker generation, audio playback during generation, saving progress, bulk sentence regeneration, and audiobook reloading/exporting.
Codebase rewritten for modularity and maintainability, allowing easier addition of new engines.

Maintenance & Community

The project is actively maintained by JarodMica. Updates are managed via git pull and git submodule update. The README provides instructions for handling potential conflicts if local modifications have been made.

Licensing & Compatibility

The core application's engines are MIT or Apache-2.0 licensed. However, pre-trained models have specific usage limitations: StyleTTS 2 requires attribution or explicit permission for synthesized voices, and F5-TTS uses a CC-By-NC-4.0 licensed base model, restricting commercial use.

Limitations & Caveats

The application is primarily designed for Windows. The use of a GUI framework other than Gradio means it cannot be run on cloud computers and requires local hardware. The F5-TTS base model's non-commercial license restricts its use in commercial audiobook production. Torch version management can be complex, potentially requiring reinstallation after adding different engines.

audiobook_maker by JarodMica

Explore Similar Projects

Meta-voicebox by SpeechifyInc

Pandrator by lukaszliniewicz

WavJourney by Audio-AGI

epub2tts by aedocw

FireRedTTS by FireRedTeam

whisper-plus by kadirnar

PDF2Audio by lamm-mit

abogen by denizsafak

MARS5-TTS by Camb-ai

Step-Audio by stepfun-ai

Amphion by open-mmlab

Zonos by Zyphra