neutts by neuphonic

On-device Text-to-Speech with instant voice cloning

Created 5 months ago

4,897 stars

Top 10.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Summary

NeuTTS Air tackles the inaccessibility of state-of-the-art Text-to-Speech (TTS) models by offering the world's first super-realistic, on-device TTS with instant voice cloning. It targets developers building embedded voice agents, assistants, and compliance-safe applications, enabling natural-sounding, real-time speech generation with local security and speaker cloning.

How It Works

The system utilizes a compact 0.5B LLM backbone (Qwen 0.5B) and a proprietary neural audio codec (NeuCodec). This efficient architecture is optimized for on-device inference across diverse hardware like smartphones, laptops, and Raspberry Pis, balancing speed, size, and quality for real-time generation and low power consumption.

Quick Start & Requirements

Clone the repo, install espeak (brew install espeak on Mac, sudo apt install espeak on Ubuntu/Debian), and run pip install -r requirements.txt (Python >= 3.11). Optional installs include llama-cpp-python for GGUF models (with CUDA/MPS support) and onnxruntime for ONNX decoders. A basic example script demonstrates synthesis. Links to HuggingFace models (GGUF, Q8, Q4) and a YouTube demo are available.

Highlighted Details

Best-in-class realism for its compact size.
Optimized for on-device deployment (GGML format) on mobile/embedded systems.
Instant voice cloning from as little as 3 seconds of reference audio.
Generated audio is watermarked via Perth Watermarker.
Real-time inference on mid-range hardware.

Maintenance & Community

The provided README lacks specific details on project maintainers, community channels (Discord/Slack), roadmaps, or sponsorships.

Licensing & Compatibility

The repository's README does not explicitly state the software license. This omission is a significant adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

Optimal performance requires specific reference audio quality (mono, 16-44 kHz, 3-15s, .wav, clean, natural speech). Outputs are watermarked. Specific hardware limitations beyond "mid-range devices" for real-time performance are not detailed.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

117 stars in the last 30 days