kani-tts  by nineninesix-ai

Fast, high-quality text-to-speech generation

Created 3 months ago
382 stars

Top 74.8% on SourcePulse

GitHubView on GitHub
Project Summary

A fast, modular, and human-like text-to-speech (TTS) system, Kani TTS generates high-quality speech from text. It targets developers and researchers seeking flexible TTS solutions, offering multilingual support and optimized inference across diverse hardware, including NVIDIA GPUs and Apple Silicon.

How It Works

Kani TTS employs a modular architecture with various pre-trained models supporting multiple languages and sizes. It utilizes the NVIDIA NeMo NanoCodec for efficient audio compression and decompression, enabling rapid inference. The system provides specialized inference pipelines: vLLM for high-performance NVIDIA GPU acceleration with an OpenAI-compatible API, and MLX for optimized performance on Apple Silicon leveraging its unified memory and Neural Engine.

Quick Start & Requirements

  • Installation: Install via PyPI: pip install kani-tts.
  • Inference: Options include basic (GPU/CPU), vLLM (NVIDIA GPU), and MLX (Apple Silicon). Refer to the examples/ directory for getting started.
  • Prerequisites: Optimized inference requires specific hardware (NVIDIA GPU with CUDA for vLLM, Apple Silicon for MLX).
  • Links: Discord server: https://discord.gg/NzP3rjB4SB

Highlighted Details

  • Multilingual Models: Supports English, Chinese, German, Arabic, Spanish, Korean, and Japanese.
  • Performance: Benchmarks indicate fast inference, with a Rate-to-Time (RTF) of 0.190 on an RTX 5090, signifying faster-than-real-time generation.
  • Hardware Optimization: Dedicated inference paths for NVIDIA GPUs (vLLM) and Apple Silicon (MLX).
  • Dataset & Finetuning: Includes tools like Datamio for dataset preparation and a comprehensive finetuning pipeline for custom model training.

Maintenance & Community

Community contributions are actively encouraged via a Discord server. Development focuses on enhancing the core architecture with specialized LLMs for TTS, expanding language and speaker support, improving audio codecs, and building diverse datasets.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and modification with attribution.

Limitations & Caveats

Performance may degrade with input text exceeding 1000 tokens. Limited emotional expressivity is noted unless models are fine-tuned on specific datasets.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.1%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.