audiobook_maker  by JarodMica

Audiobook creator using open-source TTS/STS models

created 1 year ago
475 stars

Top 65.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Windows GUI application for creating audiobooks using deep learning text-to-speech (TTS) and speech-to-speech (S2S) models. It targets users who want to generate high-quality audiobooks with features like multi-speaker support, in-progress saving, and bulk editing, leveraging advancements in AI for a more seamless workflow.

How It Works

The application employs a modular architecture, moving from a monolithic design to a closer approximation of Model-View-Controller (MVC). This separation of concerns into view.py (GUI), controller.py (logic), and model.py (functional code) enhances maintainability and facilitates the integration of new TTS and S2S engines. Each engine is configured dynamically, requiring only a defined loading and generation procedure that returns an audio path to model.py.

Quick Start & Requirements

  • Installation: Manual installation involves cloning the repository, setting up a Python 3.11 virtual environment, installing requirements (pip install -r requirements.txt), initializing and updating submodules, and launching the controller (python src/controller.py).
  • Prerequisites: An NVIDIA GPU with at least 8GB VRAM is recommended. CUDA 12.1 toolkit is required for RVC. Python 3.11, Git, and FFmpeg are also necessary. Specific TTS/S2S engines (TortoiseTTS, StyleTTS 2, F5-TTS, RVC) have their own installation steps, often involving submodule updates and specific pip installs.
  • Resources: Torch installation can be substantial. Some models, like F5-TTS, may incur additional downloads.
  • Links: CUDA Toolkit Archive, Python 3.11, Git, FFmpeg, Torch Install Check.

Highlighted Details

  • Supports multiple TTS engines: TortoiseTTS, StyleTTS 2, F5-TTS, and XTTS (planned).
  • Integrates S2S capabilities via RVC.
  • Features include multi-speaker generation, audio playback during generation, saving progress, bulk sentence regeneration, and audiobook reloading/exporting.
  • Codebase rewritten for modularity and maintainability, allowing easier addition of new engines.

Maintenance & Community

The project is actively maintained by JarodMica. Updates are managed via git pull and git submodule update. The README provides instructions for handling potential conflicts if local modifications have been made.

Licensing & Compatibility

The core application's engines are MIT or Apache-2.0 licensed. However, pre-trained models have specific usage limitations: StyleTTS 2 requires attribution or explicit permission for synthesized voices, and F5-TTS uses a CC-By-NC-4.0 licensed base model, restricting commercial use.

Limitations & Caveats

The application is primarily designed for Windows. The use of a GUI framework other than Gradio means it cannot be run on cloud computers and requires local hardware. The F5-TTS base model's non-commercial license restricts its use in commercial audiobook production. Torch version management can be complex, potentially requiring reinstallation after adding different engines.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
41 stars in the last 90 days

Explore Similar Projects

Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 19 hours ago
Feedback? Help us improve.