AudioLDM2 by haoheliu

CLI tool for text-conditional audio/music generation

Created 2 years ago

2,556 stars

Top 18.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

AudioLDM 2 is a diffusion model for generating audio, including music, sound effects, and speech, from text prompts. It also supports super-resolution inpainting for audio. This project is suitable for researchers and developers working on generative audio models or those needing to synthesize audio content programmatically.

How It Works

AudioLDM 2 leverages latent diffusion models to generate audio. It employs a self-supervised pre-training approach, enabling it to learn holistic audio representations. The model architecture is optimized for high-fidelity audio generation, with specific checkpoints available for different tasks like music, sound effects, and text-to-speech.

Quick Start & Requirements

Install: pip3 install git+https://github.com/haoheliu/AudioLDM2.git
Prerequisites: Python 3.8+, espeak (for TTS), CUDA or MPS for GPU acceleration.
Resources: Requires approximately 20GB of RAM.
Docs: https://github.com/haoheliu/AudioLDM2
Diffusers Integration: https://huggingface.co/docs/diffusers/main/en/api/pipelines/audio_ldm2

Highlighted Details

Supports text-to-audio, text-to-music, and text-to-speech generation.
Offers high-fidelity audio generation with a 48kHz model checkpoint.
Integrates with Hugging Face Diffusers library, offering up to 3x faster inference and arbitrary audio length generation.
Includes command-line interface for direct usage and a Gradio web app.

Maintenance & Community

The project is actively maintained by haoheliu and contributors. Further details on community channels are not explicitly listed in the README.

Licensing & Compatibility

The project does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README mentions that model performance can vary across hardware, suggesting users may need to adjust random seeds. Some features, like style transfer and inpainting for the 48kHz model, are noted as pending or welcomed as community contributions.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days