audiobook_maker  by JarodMica

Audiobook creator using open-source TTS/STS models

Created 2 years ago
486 stars

Top 63.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a Windows GUI application for creating audiobooks using deep learning text-to-speech (TTS) and speech-to-speech (S2S) models. It targets users who want to generate high-quality audiobooks with features like multi-speaker support, in-progress saving, and bulk editing, leveraging advancements in AI for a more seamless workflow.

How It Works

The application employs a modular architecture, moving from a monolithic design to a closer approximation of Model-View-Controller (MVC). This separation of concerns into view.py (GUI), controller.py (logic), and model.py (functional code) enhances maintainability and facilitates the integration of new TTS and S2S engines. Each engine is configured dynamically, requiring only a defined loading and generation procedure that returns an audio path to model.py.

Quick Start & Requirements

  • Installation: Manual installation involves cloning the repository, setting up a Python 3.11 virtual environment, installing requirements (pip install -r requirements.txt), initializing and updating submodules, and launching the controller (python src/controller.py).
  • Prerequisites: An NVIDIA GPU with at least 8GB VRAM is recommended. CUDA 12.1 toolkit is required for RVC. Python 3.11, Git, and FFmpeg are also necessary. Specific TTS/S2S engines (TortoiseTTS, StyleTTS 2, F5-TTS, RVC) have their own installation steps, often involving submodule updates and specific pip installs.
  • Resources: Torch installation can be substantial. Some models, like F5-TTS, may incur additional downloads.
  • Links: CUDA Toolkit Archive, Python 3.11, Git, FFmpeg, Torch Install Check.

Highlighted Details

  • Supports multiple TTS engines: TortoiseTTS, StyleTTS 2, F5-TTS, and XTTS (planned).
  • Integrates S2S capabilities via RVC.
  • Features include multi-speaker generation, audio playback during generation, saving progress, bulk sentence regeneration, and audiobook reloading/exporting.
  • Codebase rewritten for modularity and maintainability, allowing easier addition of new engines.

Maintenance & Community

The project is actively maintained by JarodMica. Updates are managed via git pull and git submodule update. The README provides instructions for handling potential conflicts if local modifications have been made.

Licensing & Compatibility

The core application's engines are MIT or Apache-2.0 licensed. However, pre-trained models have specific usage limitations: StyleTTS 2 requires attribution or explicit permission for synthesized voices, and F5-TTS uses a CC-By-NC-4.0 licensed base model, restricting commercial use.

Limitations & Caveats

The application is primarily designed for Windows. The use of a GUI framework other than Gradio means it cannot be run on cloud computers and requires local hardware. The F5-TTS base model's non-commercial license restricts its use in commercial audiobook production. Torch version management can be complex, potentially requiring reinstallation after adding different engines.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

Amphion by open-mmlab

0.2%
9k
Toolkit for audio, music, and speech generation research
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.