Amphion  by open-mmlab

Toolkit for audio, music, and speech generation research

Created 1 year ago
9,390 stars

Top 5.4% on SourcePulse

GitHubView on GitHub
Project Summary

Amphion is an open-source toolkit designed to facilitate reproducible research and development in audio, music, and speech generation. It targets junior researchers and engineers, offering a comprehensive platform with support for various generation tasks, vocoders, and evaluation metrics, aiming to simplify the process of converting any input into audio.

How It Works

Amphion provides implementations of numerous state-of-the-art models across diverse audio generation tasks, including TTS, VC, AC, SVC, and TTA. It leverages architectures such as diffusion, transformer, VAE, and flow-based models. A key differentiator is its inclusion of visualization tools for classic models, aiding understanding of internal mechanisms, and its support for large-scale dataset preprocessing, exemplified by the Emilia dataset and Emilia-Pipe.

Quick Start & Requirements

  • Installation: Via git clone and sh env.sh for Python dependencies, or via Docker (docker pull realamphion/amphion and docker run).
  • Prerequisites: Python 3.9.15, Docker, NVIDIA Driver, NVIDIA Container Toolkit, CUDA (for Docker).
  • Resources: Requires significant disk space for datasets and GPU resources for training/inference.
  • Docs: https://github.com/open-mmlab/Amphion

Highlighted Details

  • Supports 10+ TTS architectures (FastSpeech2, VITS, VALL-E, NaturalSpeech2, MaskGCT, Vevo-TTS) and 5+ VC models (Vevo, FACodec, Noro).
  • Includes a wide array of vocoders (GAN-based, Flow-based, Diffusion-based, Auto-regressive) and comprehensive objective evaluation metrics.
  • Features the Emilia dataset (101k+ hours) and Emilia-Pipe for in-the-wild speech data preprocessing.
  • Offers visualization tools like SingVisio for understanding model mechanisms.

Maintenance & Community

  • Active development with recent releases (Vevo1.5, Metis, MaskGCT, Vevo reproduction).
  • Community engagement via Discord channel.
  • Contributions are welcomed via CONTRIBUTING.md.

Licensing & Compatibility

  • MIT License, permitting free research and commercial use.

Limitations & Caveats

  • Some tasks like Singing Voice Synthesis (SVS) and Text to Music (TTM) are marked as "developing."
  • Docker setup requires specific NVIDIA software stack.
Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
115 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.