LongCat-AudioDiT by meituan-longcat

High-fidelity diffusion text-to-speech and voice cloning

Created 3 months ago

548 stars

Top 57.4% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> LongCat-AudioDiT is a state-of-the-art diffusion-based text-to-speech (TTS) model that operates directly in the waveform latent space. It targets researchers and developers seeking high-fidelity, non-autoregressive TTS and SOTA zero-shot voice cloning, simplifying the pipeline and enhancing generation quality.

How It Works

The core innovation is operating directly within the waveform latent space, bypassing intermediate acoustic representations like mel-spectrograms. This approach uses a waveform variational autoencoder (Wav-VAE) and a diffusion backbone, drastically simplifying the TTS pipeline and mitigating compounding errors. Inference is improved via a training-inference mismatch correction and adaptive projection guidance (APG), which replaces traditional classifier-free guidance for elevated generation quality.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Requires a CUDA-enabled GPU for efficient operation, as indicated by .to("cuda") and .to_half() calls. Dependencies include torch and transformers.
Links: HuggingFace-compatible implementation is provided. Model weights are available via meituan-longcat/LongCat-AudioDiT-1B and meituan-longcat/LongCat-AudioDiT-3.5B.

Highlighted Details

Achieves state-of-the-art (SOTA) zero-shot voice cloning performance on the Seed benchmark, surpassing both open-source and closed-source models.
The LongCat-TTS-3.5B variant improves speaker similarity (SIM) scores on Seed-ZH (0.818) and Seed-Hard (0.797) compared to previous SOTA (Seed-TTS).
Code and model weights are publicly released to foster research.

Maintenance & Community

Contact: longcat-team@meituan.com
Community: A WeChat Group is mentioned for communication.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and integration into closed-source projects. Does not grant rights to Meituan trademarks or patents.

Limitations & Caveats

A key finding indicates that superior Wav-VAE reconstruction fidelity does not necessarily correlate with better overall TTS performance.
No explicit mention of alpha status, known bugs, or unsupported platforms.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

27 stars in the last 30 days

Explore Similar Projects

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 3 years ago

Updated 3 years ago

GenerSpeech by Rongjiehuang

Text-to-speech model for zero-shot style transfer of custom voice

Created 3 years ago

Updated 2 years ago

DiffGAN-TTS by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

Created 4 years ago

Updated 4 years ago

ComfyUI-VoxCPM by wildminder

Speech synthesis and voice cloning node for ComfyUI

Created 9 months ago

Updated 2 months ago

ProDiff by Rongjiehuang

PyTorch implementation for fast diffusion text-to-speech

Created 3 years ago

Updated 3 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 2 years ago

Confucius4-TTS by netease-youdao

Multilingual and cross-lingual zero-shot TTS engine

Created 1 month ago

Updated 2 days ago

VibeVoice-finetuning by voicepowered-ai

Efficient LoRA finetuning for VibeVoice speech synthesis

Created 9 months ago

Updated 9 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

HierSpeechpp by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

Created 2 years ago

Updated 2 years ago

Spark-TTS by SparkAudio

PyTorch code for efficient LLM-based text-to-speech inference

Created 1 year ago

Updated 1 year ago

OmniVoice by k2-fsa

State-of-the-art multilingual TTS for voice cloning and design

Created 3 months ago

Updated 2 days ago

Starred by

Didier Lopes

Didier Lopes(Founder of OpenBB),

Dan Guido

Dan Guido(Cofounder of Trail of Bits), and

4 more.

VibeVoice by microsoft

Frontier Text-to-Speech for long conversations

Created 10 months ago

Updated 1 week ago

Feedback? Help us improve.