This project provides an open-source tool for converting books (EPUB, PDF, TXT) into audiobooks with intelligent character voice attribution. It targets users who want to create personalized, multi-voice audiobooks using NLP and LLM technologies, offering an engaging listening experience.
How It Works
The tool processes books in three stages: text cleaning and formatting, character identification using GLiNER and LLMs for gender/age scoring, and audiobook generation with Kokoro TTS. It supports both single-voice and multi-voice narration, assigning distinct voices to identified characters based on inferred attributes. This approach enables dynamic, character-aware audiobooks.
Quick Start & Requirements
- Installation: Docker (recommended) or direct run via
uv
and Python 3.12.
- Prerequisites:
- Docker with host networking enabled.
- An OpenAI-compatible LLM endpoint (e.g., LM Studio).
- Kokoro TTS model setup (via Kokoro-FastAPI).
- For direct run:
calibre
(optional, for M4B) and ffmpeg
.
- Setup: Detailed instructions for Docker and direct installation are provided.
- Links: Demo Video
Highlighted Details
- Gradio UI for user-friendly audiobook creation.
- M4B audiobook creation with metadata and chapter timestamps.
- Multi-format input (EPUB, PDF, TXT) and output (AAC, M4A, MP3, WAV, OPUS, FLAC, PCM, M4B).
- Supports parallel batch inferencing for faster audio generation via Kokoro TTS.
Maintenance & Community
- Contributions are welcome via GitHub issues and pull requests.
- Donations are accepted via PayPal.
- GitHub Repository
Licensing & Compatibility
- Licensed under GNU General Public License v3.0 (GPL-3.0).
- GPL-3.0 is a strong copyleft license, requiring derivative works to also be open-sourced under the same license. Commercial use or linking with closed-source software may require careful consideration of license obligations.
Limitations & Caveats
- Apple Silicon GPUs are not currently supported for CUDA-based inference; CPU inference is the alternative.
- The effectiveness of character voice attribution relies on the accuracy of the underlying NLP and LLM models.