Discover and explore top open-source AI tools and projects—updated daily.
meituan-longcatHigh-fidelity diffusion text-to-speech and voice cloning
New!
Top 70.2% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> LongCat-AudioDiT is a state-of-the-art diffusion-based text-to-speech (TTS) model that operates directly in the waveform latent space. It targets researchers and developers seeking high-fidelity, non-autoregressive TTS and SOTA zero-shot voice cloning, simplifying the pipeline and enhancing generation quality.
How It Works
The core innovation is operating directly within the waveform latent space, bypassing intermediate acoustic representations like mel-spectrograms. This approach uses a waveform variational autoencoder (Wav-VAE) and a diffusion backbone, drastically simplifying the TTS pipeline and mitigating compounding errors. Inference is improved via a training-inference mismatch correction and adaptive projection guidance (APG), which replaces traditional classifier-free guidance for elevated generation quality.
Quick Start & Requirements
pip install -r requirements.txt.to("cuda") and .to_half() calls. Dependencies include torch and transformers.meituan-longcat/LongCat-AudioDiT-1B and meituan-longcat/LongCat-AudioDiT-3.5B.Highlighted Details
Maintenance & Community
longcat-team@meituan.comLicensing & Compatibility
Limitations & Caveats
1 week ago
Inactive
yl4579