fish-diffusion  by fishaudio

TTS/SVS/SVC framework for voice generation tasks

Created 2 years ago
719 stars

Top 47.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Fish Diffusion is a framework for training Text-to-Speech (TTS), Singing Voice Synthesis (SVS), and Singing Voice Conversion (SVC) models. It aims to provide an easy-to-understand and modular codebase for researchers and developers working on voice generation tasks. The project leverages diffusion models for voice generation, offering multi-speaker support and compatibility with 44.1kHz vocoders.

How It Works

The framework utilizes diffusion models to generate audio, offering a simpler and more decoupled code structure compared to its predecessor, diffsvc. It supports multi-speaker training, 44.1kHz vocoders, multi-machine/multi-device training, and half-precision training for improved speed and memory efficiency.

Quick Start & Requirements

  • Installation: Requires Python 3.10 and PyTorch >= 2.0.0. Dependencies are managed via PDM (pdm sync).
  • Vocoder: Requires the FishAudio NSF-HiFiGAN vocoder, which can be downloaded automatically (python tools/download_nsf_hifigan.py).
  • Dataset: Audio data should be placed in a dataset directory with train and valid subfolders. Feature extraction is performed using python tools/preprocessing/extract_features.py.
  • Training: Single-machine/multi-card training: python tools/diffusion/train.py --config configs/svc_hubert_soft.py. Multi-node training requires PyTorch Lightning setup.
  • Inference: python tools/diffusion/inference.py --config [config] --checkpoint [checkpoint file] --input [input audio] --output [output audio]. Gradio interface available.
  • Links: Wiki, NSF-HiFiGAN, ContentVec

Highlighted Details

  • Supports multi-speaker training.
  • Decoupled and simpler code structure for easier understanding.
  • Supports 44.1kHz Diff Singer community vocoder.
  • Enables multi-machine, multi-device, and half-precision training.

Maintenance & Community

The project is under active development. Contributions are welcome via issues or pull requests. Linting is required before submission (pdm run lint). Real-time documentation can be generated (pdm run docs).

Licensing & Compatibility

The NSF-HiFiGAN vocoder is licensed under CC BY-NC-SA 4.0. The primary license for Fish Diffusion is not explicitly stated in the README, but attribution is required for derivative works and results.

Limitations & Caveats

The project is under active development, and users are advised to back up their configuration files. The NSF-HiFiGAN vocoder's CC BY-NC-SA 4.0 license may restrict commercial use or redistribution of derived models.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

Amphion by open-mmlab

0.2%
9k
Toolkit for audio, music, and speech generation research
Created 1 year ago
Updated 3 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.