Make-An-Audio  by Text-to-Audio

Text-to-audio generation with diffusion models

Created 2 years ago
654 stars

Top 51.1% on SourcePulse

GitHubView on GitHub
Project Summary

Make-An-Audio provides a PyTorch implementation of a text-to-audio generative model based on conditional diffusion probabilistic models. It allows users to generate high-fidelity audio from text prompts, targeting researchers and developers in the audio generation space. The project offers pre-trained models and a clear implementation, enabling efficient and high-quality audio synthesis.

How It Works

The model utilizes a prompt-enhanced diffusion approach, specifically a conditional diffusion probabilistic model. This method allows for the generation of high-fidelity audio efficiently by conditioning the diffusion process on text prompts. The architecture likely involves a diffusion model that learns to denoise data, guided by text embeddings, potentially using a VAE for latent space manipulation and a vocoder (like BigVGAN) for waveform synthesis.

Quick Start & Requirements

  • Install/Run: Clone the repository and use provided Python scripts for inference and training.
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN, Python. Specific checkpoints (maa1_full.ckpt, BigVGAN vocoder, CLAP weights) need to be downloaded and placed in ./useful_ckpts.
  • Setup: Requires downloading several large checkpoint files. Inference command example: python gen_wav.py --prompt "a bird chirps" --ddim_steps 100 --duration 10 --scale 3 --n_samples 1 --save_name "results".
  • Links: HuggingFace Spaces for demos are available.

Highlighted Details

  • Implements Make-An-Audio (ICML'23), a text-to-audio diffusion model.
  • Supports audio inpainting via a separate HuggingFace Space.
  • Provides scripts for dataset preprocessing, VAE training, diffusion model training, and various audio quality evaluations (FD, FAD, IS, KL, CLAP score).
  • Includes an Audio2Audio script for audio style transfer or modification.

Maintenance & Community

The project is associated with ICML'23 and has an arXiv preprint. It references code from CLAP and Stable Diffusion repositories. No specific community links (Discord, Slack) or active maintenance signals are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the disclaimer warns against using the technology to generate speech without consent, implying potential legal and ethical considerations. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Dataset download links are not provided due to copyright issues, requiring users to source their own audio data. The disclaimer strongly advises against unauthorized speech generation, highlighting ethical and legal risks. The project's reliance on specific checkpoint files and a potentially complex training pipeline may present adoption challenges.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 2 years ago
Updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
37 more.

diffusers by huggingface

0.3%
31k
PyTorch/Flax library for diffusion model research and applications
Created 3 years ago
Updated 14 hours ago
Feedback? Help us improve.