neutts-air  by neuphonic

On-device Text-to-Speech with instant voice cloning

Created 1 week ago

New!

3,114 stars

Top 15.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

NeuTTS Air tackles the inaccessibility of state-of-the-art Text-to-Speech (TTS) models by offering the world's first super-realistic, on-device TTS with instant voice cloning. It targets developers building embedded voice agents, assistants, and compliance-safe applications, enabling natural-sounding, real-time speech generation with local security and speaker cloning.

How It Works

The system utilizes a compact 0.5B LLM backbone (Qwen 0.5B) and a proprietary neural audio codec (NeuCodec). This efficient architecture is optimized for on-device inference across diverse hardware like smartphones, laptops, and Raspberry Pis, balancing speed, size, and quality for real-time generation and low power consumption.

Quick Start & Requirements

Clone the repo, install espeak (brew install espeak on Mac, sudo apt install espeak on Ubuntu/Debian), and run pip install -r requirements.txt (Python >= 3.11). Optional installs include llama-cpp-python for GGUF models (with CUDA/MPS support) and onnxruntime for ONNX decoders. A basic example script demonstrates synthesis. Links to HuggingFace models (GGUF, Q8, Q4) and a YouTube demo are available.

Highlighted Details

  • Best-in-class realism for its compact size.
  • Optimized for on-device deployment (GGML format) on mobile/embedded systems.
  • Instant voice cloning from as little as 3 seconds of reference audio.
  • Generated audio is watermarked via Perth Watermarker.
  • Real-time inference on mid-range hardware.

Maintenance & Community

The provided README lacks specific details on project maintainers, community channels (Discord/Slack), roadmaps, or sponsorships.

Licensing & Compatibility

The repository's README does not explicitly state the software license. This omission is a significant adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

Optimal performance requires specific reference audio quality (mono, 16-44 kHz, 3-15s, .wav, clean, natural speech). Outputs are watermarked. Specific hardware limitations beyond "mid-range devices" for real-time performance are not detailed.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
18
Issues (30d)
29
Star History
3,224 stars in the last 13 days

Explore Similar Projects

Feedback? Help us improve.