LongCat-AudioDiT  by meituan-longcat

High-fidelity diffusion text-to-speech and voice cloning

Created 1 week ago

New!

418 stars

Top 70.2% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> LongCat-AudioDiT is a state-of-the-art diffusion-based text-to-speech (TTS) model that operates directly in the waveform latent space. It targets researchers and developers seeking high-fidelity, non-autoregressive TTS and SOTA zero-shot voice cloning, simplifying the pipeline and enhancing generation quality.

How It Works

The core innovation is operating directly within the waveform latent space, bypassing intermediate acoustic representations like mel-spectrograms. This approach uses a waveform variational autoencoder (Wav-VAE) and a diffusion backbone, drastically simplifying the TTS pipeline and mitigating compounding errors. Inference is improved via a training-inference mismatch correction and adaptive projection guidance (APG), which replaces traditional classifier-free guidance for elevated generation quality.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: Requires a CUDA-enabled GPU for efficient operation, as indicated by .to("cuda") and .to_half() calls. Dependencies include torch and transformers.
  • Links: HuggingFace-compatible implementation is provided. Model weights are available via meituan-longcat/LongCat-AudioDiT-1B and meituan-longcat/LongCat-AudioDiT-3.5B.

Highlighted Details

  • Achieves state-of-the-art (SOTA) zero-shot voice cloning performance on the Seed benchmark, surpassing both open-source and closed-source models.
  • The LongCat-TTS-3.5B variant improves speaker similarity (SIM) scores on Seed-ZH (0.818) and Seed-Hard (0.797) compared to previous SOTA (Seed-TTS).
  • Code and model weights are publicly released to foster research.

Maintenance & Community

  • Contact: longcat-team@meituan.com
  • Community: A WeChat Group is mentioned for communication.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and integration into closed-source projects. Does not grant rights to Meituan trademarks or patents.

Limitations & Caveats

  • A key finding indicates that superior Wav-VAE reconstruction fidelity does not necessarily correlate with better overall TTS performance.
  • No explicit mention of alpha status, known bugs, or unsupported platforms.
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
19
Star History
420 stars in the last 13 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.1%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.