kani-tts by nineninesix-ai

Fast, high-quality text-to-speech generation

Created 10 months ago

461 stars

Top 64.9% on SourcePulse

Project Summary

A fast, modular, and human-like text-to-speech (TTS) system, Kani TTS generates high-quality speech from text. It targets developers and researchers seeking flexible TTS solutions, offering multilingual support and optimized inference across diverse hardware, including NVIDIA GPUs and Apple Silicon.

How It Works

Kani TTS employs a modular architecture with various pre-trained models supporting multiple languages and sizes. It utilizes the NVIDIA NeMo NanoCodec for efficient audio compression and decompression, enabling rapid inference. The system provides specialized inference pipelines: vLLM for high-performance NVIDIA GPU acceleration with an OpenAI-compatible API, and MLX for optimized performance on Apple Silicon leveraging its unified memory and Neural Engine.

Quick Start & Requirements

Installation: Install via PyPI: pip install kani-tts.
Inference: Options include basic (GPU/CPU), vLLM (NVIDIA GPU), and MLX (Apple Silicon). Refer to the examples/ directory for getting started.
Prerequisites: Optimized inference requires specific hardware (NVIDIA GPU with CUDA for vLLM, Apple Silicon for MLX).
Links: Discord server: https://discord.gg/NzP3rjB4SB

Highlighted Details

Multilingual Models: Supports English, Chinese, German, Arabic, Spanish, Korean, and Japanese.
Performance: Benchmarks indicate fast inference, with a Rate-to-Time (RTF) of 0.190 on an RTX 5090, signifying faster-than-real-time generation.
Hardware Optimization: Dedicated inference paths for NVIDIA GPUs (vLLM) and Apple Silicon (MLX).
Dataset & Finetuning: Includes tools like Datamio for dataset preparation and a comprehensive finetuning pipeline for custom model training.

Maintenance & Community

Community contributions are actively encouraged via a Discord server. Development focuses on enhancing the core architecture with specialized LLMs for TTS, expanding language and speaker support, improving audio codecs, and building diverse datasets.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and modification with attribution.

Limitations & Caveats

Performance may degrade with input text exceeding 1000 tokens. Limited emotional expressivity is noted unless models are fine-tuned on specific datasets.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days