AudioLCM by Text-to-Audio

Efficient text-to-audio generation

Created 1 year ago

1,155 stars

Top 33.4% on SourcePulse

Project Summary

This repository provides a PyTorch implementation of AudioLCM, a text-to-audio generation model that leverages latent consistency models for efficient and high-quality audio synthesis. It is suitable for researchers and developers working on advanced audio generation techniques.

How It Works

AudioLCM utilizes a latent consistency model approach, which allows for high-fidelity audio generation with a minimal number of inference steps. This method is designed to be more efficient than traditional diffusion models while maintaining audio quality. The architecture likely involves a latent space representation of audio, where a consistency model is trained to generate coherent and high-quality audio from text prompts.

Quick Start & Requirements

To generate audio, download the pretrained models (audiolcm.ckpt, vocoder, CLAP, and BERT weights) and place them in the specified directories. The inference can be performed using provided Python scripts (AudioLCMInfer and AudioLCMBatchInfer). For local setup and training, cloning the repository is required, along with an NVIDIA GPU with CUDA and cuDNN. Dependencies are listed in requirements.txt.

Highlighted Details

Accepted by ACM-MM'24.
Supports batch generation for multiple text prompts.
Includes scripts for dataset preparation, including mel-spectrogram generation.
Provides configuration files for training VAE and latent diffusion models.

Maintenance & Community

The project has seen recent releases of related projects like ThinkSound and OmniAudio, indicating active development in the broader research area. Specific community links (Discord/Slack) or roadmap details are not provided in the README.

Licensing & Compatibility

The repository is open-source, but a specific license type (e.g., MIT, Apache) is not explicitly stated. A disclaimer prohibits using the technology to generate speech without consent, which may have implications for commercial use or redistribution.

Limitations & Caveats

The dataset download links are not provided due to copyright issues. The disclaimer regarding speech generation without consent should be carefully considered for any application. The README does not detail specific performance benchmarks or comparisons against other text-to-audio models.

AudioLCM by Text-to-Audio

Explore Similar Projects

mustango by AMAAI-Lab

FastDiff by Rongjiehuang

VITA-Audio by VITA-MLLM

Chatterbox-TTS-Extended by petermg

Make-An-Audio by Text-to-Audio

tango by declare-lab

PDF2Audio by lamm-mit

AudioLDM by haoheliu

audiolm-pytorch by lucidrains

AudioGPT by AIGC-Audio

VibeVoice by microsoft

audiocraft by facebookresearch