naturalspeech2-pytorch by lucidrains

PyTorch implementation of Natural Speech 2, a zero-shot speech/singing synthesizer

Created 2 years ago

1,333 stars

Top 29.9% on SourcePulse

1 Expert Loves This Project

aangelopoulos

Anastasios Angelopoulos

Cofounder of LMArena

Project Summary

This repository provides a PyTorch implementation of NaturalSpeech 2, a zero-shot speech and singing synthesizer. It targets ML/AI engineers and researchers in the TTS field, offering a novel approach to text-to-speech synthesis using a neural audio codec and a latent diffusion model for non-autoregressive generation, enabling natural and expressive speech.

How It Works

The system leverages a latent diffusion model operating on continuous latent vectors from a neural audio codec (Encodec). This approach allows for non-autoregressive generation of speech, contributing to naturalness and efficiency. The implementation focuses on denoising diffusion and incorporates improvements to transformer components, aiming for state-of-the-art performance.

Quick Start & Requirements

Install: pip install naturalspeech2-pytorch
Requirements: PyTorch, CUDA (implied by .cuda() calls), naturalspeech2-pytorch library.
Usage examples and a Trainer class are provided in the README.
Official Docs: Not explicitly linked, but the README serves as primary documentation.

Highlighted Details

Zero-shot speech and singing synthesis capabilities.
Utilizes latent diffusion models with continuous latent vectors.
Non-autoregressive generation for natural speech.
Supports conditioning on text and speech prompts.
Includes a Trainer class for simplified training and sampling loops.

Maintenance & Community

Developed by lucidrains, with contributions acknowledged from Manmay.
Mentions Huggingface for sponsorships and the accelerate library.
The project is marked as "wip" (work in progress).

Licensing & Compatibility

The README does not explicitly state a license. Given the nature of the project and its dependencies, users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project is marked as "wip," indicating ongoing development and potential for breaking changes.
Some features, like automatic slicing of audio for prompts and specific conditioning methods, are still under development or require further consultation.
The usage examples imply a need for significant computational resources (GPU) for training and inference.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Ming-UniAudio by inclusionAI

Unified speech LLM for understanding, generation, and editing

Created 3 months ago

Updated 1 month ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

Comprehensive-Transformer-TTS by keonlee9420

PyTorch toolkit for non-autoregressive transformer text-to-speech (TTS)

Created 4 years ago

Updated 3 years ago

DiffGAN-TTS by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

Created 3 years ago

Updated 3 years ago

ProDiff by Rongjiehuang

PyTorch implementation for fast diffusion text-to-speech

Created 3 years ago

Updated 2 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

NATSpeech by NATSpeech

PyTorch framework for non-autoregressive text-to-speech (NAR-TTS)

Created 3 years ago

Updated 2 years ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

speech-synthesis-paper by wenet-e2e

Speech synthesis papers list

Created 5 years ago

Updated 2 years ago

TransformerTTS by spring-media

TensorFlow 2 implementation for non-autoregressive text-to-speech

Created 5 years ago

Updated 1 year ago

MegaTTS3 by bytedance

PyTorch implementation for zero-shot speech synthesis

Created 9 months ago

Updated 4 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Didier Lopes

Didier Lopes(Founder of OpenBB).

Zonos by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 11 months ago

Updated 10 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral),

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs), and

3 more.

espnet by espnet

End-to-end speech processing toolkit for various speech tasks

Created 8 years ago

Updated 3 weeks ago

Feedback? Help us improve.