pocket-tts by kyutai-labs

Lightweight, CPU-first text-to-speech for efficient audio generation

Created 6 months ago

6,521 stars

Top 7.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Chris Van Pelt

Cofounder of Weights & Biases

Project Summary

A lightweight, CPU-optimized Text-to-Speech (TTS) application designed to eliminate the need for GPUs or external web APIs. It targets developers and users requiring efficient, low-latency audio generation directly on local machines or client-side applications, offering seamless integration without complex infrastructure.

How It Works

The system employs a compact 100M parameter model specifically engineered for CPU execution, eschewing GPU dependencies. Its architecture facilitates audio streaming, achieving low latency (~200ms for the first audio chunk) and impressive speed (~6x real-time on a MacBook Air M4 CPU). It handles infinitely long text inputs and supports voice cloning via audio prompts.

Quick Start & Requirements

Installation: pip install pocket-tts or uv add pocket-tts.
Prerequisites: Python 3.10-3.14, PyTorch 2.5+ (CPU version).
Resources: A live demo is available on the Kyutai website. Further resources include the GitHub Repository, Hugging Face Model Card, Tech report, Paper, and comprehensive Documentation.

Highlighted Details

CPU-native operation: No GPU required, simplifying deployment.
Small footprint: 100M parameters for efficient resource usage.
Real-time audio streaming: Low latency (~200ms) and fast generation (~6x real-time).
Voice cloning: Supports custom voices from audio samples.
Flexible API: Offers both Python library and CLI interfaces.
Browser-ready: Community implementations enable client-side execution.

Maintenance & Community

Contributions are welcomed via GitHub issues and pull requests. Development instructions are available in CONTRIBUTING.md. No official community channels (e.g., Discord, Slack) are specified. Authors include Manu Orsini, Simon Rouard, Gabriel De Marmiesse, Václav Volhejn, Neil Zeghidour, and Alexandre Défossez.

Licensing & Compatibility

The license for the core Pocket TTS library is not explicitly stated in the provided README. However, the project utilizes a catalog of pre-made voices and supports voice cloning, with specific licenses for each voice detailed on a linked page. Users must carefully review these voice licenses for compliance, especially concerning commercial use or derivative works. Compatibility for client-side browser deployment is noted.

Limitations & Caveats

Currently supports English only. Does not natively support adding explicit silence markers for pauses within text. Official WebAssembly/JavaScript support is pending; community implementations exist. Quantization (e.g., int8) is not yet supported. GPU acceleration was not found to provide speed benefits due to model size and batching strategy.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,960 stars in the last 30 days