fish-diffusion  by fishaudio

TTS/SVS/SVC framework for voice generation tasks

created 2 years ago
716 stars

Top 49.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Fish Diffusion is a framework for training Text-to-Speech (TTS), Singing Voice Synthesis (SVS), and Singing Voice Conversion (SVC) models. It aims to provide an easy-to-understand and modular codebase for researchers and developers working on voice generation tasks. The project leverages diffusion models for voice generation, offering multi-speaker support and compatibility with 44.1kHz vocoders.

How It Works

The framework utilizes diffusion models to generate audio, offering a simpler and more decoupled code structure compared to its predecessor, diffsvc. It supports multi-speaker training, 44.1kHz vocoders, multi-machine/multi-device training, and half-precision training for improved speed and memory efficiency.

Quick Start & Requirements

  • Installation: Requires Python 3.10 and PyTorch >= 2.0.0. Dependencies are managed via PDM (pdm sync).
  • Vocoder: Requires the FishAudio NSF-HiFiGAN vocoder, which can be downloaded automatically (python tools/download_nsf_hifigan.py).
  • Dataset: Audio data should be placed in a dataset directory with train and valid subfolders. Feature extraction is performed using python tools/preprocessing/extract_features.py.
  • Training: Single-machine/multi-card training: python tools/diffusion/train.py --config configs/svc_hubert_soft.py. Multi-node training requires PyTorch Lightning setup.
  • Inference: python tools/diffusion/inference.py --config [config] --checkpoint [checkpoint file] --input [input audio] --output [output audio]. Gradio interface available.
  • Links: Wiki, NSF-HiFiGAN, ContentVec

Highlighted Details

  • Supports multi-speaker training.
  • Decoupled and simpler code structure for easier understanding.
  • Supports 44.1kHz Diff Singer community vocoder.
  • Enables multi-machine, multi-device, and half-precision training.

Maintenance & Community

The project is under active development. Contributions are welcome via issues or pull requests. Linting is required before submission (pdm run lint). Real-time documentation can be generated (pdm run docs).

Licensing & Compatibility

The NSF-HiFiGAN vocoder is licensed under CC BY-NC-SA 4.0. The primary license for Fish Diffusion is not explicitly stated in the README, but attribution is required for derivative works and results.

Limitations & Caveats

The project is under active development, and users are advised to back up their configuration files. The NSF-HiFiGAN vocoder's CC BY-NC-SA 4.0 license may restrict commercial use or redistribution of derived models.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.