pocket-tts  by kyutai-labs

Lightweight, CPU-first text-to-speech for efficient audio generation

Created 1 month ago
3,080 stars

Top 15.6% on SourcePulse

GitHubView on GitHub
Project Summary

A lightweight, CPU-optimized Text-to-Speech (TTS) application designed to eliminate the need for GPUs or external web APIs. It targets developers and users requiring efficient, low-latency audio generation directly on local machines or client-side applications, offering seamless integration without complex infrastructure.

How It Works

The system employs a compact 100M parameter model specifically engineered for CPU execution, eschewing GPU dependencies. Its architecture facilitates audio streaming, achieving low latency (~200ms for the first audio chunk) and impressive speed (~6x real-time on a MacBook Air M4 CPU). It handles infinitely long text inputs and supports voice cloning via audio prompts.

Quick Start & Requirements

  • Installation: pip install pocket-tts or uv add pocket-tts.
  • Prerequisites: Python 3.10-3.14, PyTorch 2.5+ (CPU version).
  • Resources: A live demo is available on the Kyutai website. Further resources include the GitHub Repository, Hugging Face Model Card, Tech report, Paper, and comprehensive Documentation.

Highlighted Details

  • CPU-native operation: No GPU required, simplifying deployment.
  • Small footprint: 100M parameters for efficient resource usage.
  • Real-time audio streaming: Low latency (~200ms) and fast generation (~6x real-time).
  • Voice cloning: Supports custom voices from audio samples.
  • Flexible API: Offers both Python library and CLI interfaces.
  • Browser-ready: Community implementations enable client-side execution.

Maintenance & Community

Contributions are welcomed via GitHub issues and pull requests. Development instructions are available in CONTRIBUTING.md. No official community channels (e.g., Discord, Slack) are specified. Authors include Manu Orsini, Simon Rouard, Gabriel De Marmiesse, Václav Volhejn, Neil Zeghidour, and Alexandre Défossez.

Licensing & Compatibility

The license for the core Pocket TTS library is not explicitly stated in the provided README. However, the project utilizes a catalog of pre-made voices and supports voice cloning, with specific licenses for each voice detailed on a linked page. Users must carefully review these voice licenses for compliance, especially concerning commercial use or derivative works. Compatibility for client-side browser deployment is noted.

Limitations & Caveats

Currently supports English only. Does not natively support adding explicit silence markers for pauses within text. Official WebAssembly/JavaScript support is pending; community implementations exist. Quantization (e.g., int8) is not yet supported. GPU acceleration was not found to provide speed benefits due to model size and batching strategy.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
56
Issues (30d)
52
Star History
3,091 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
2 more.

voxtral.c by antirez

N/A
423
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 4 days ago
Updated 2 days ago
Feedback? Help us improve.