TTS/SVS/SVC framework for voice generation tasks
Top 49.0% on sourcepulse
Fish Diffusion is a framework for training Text-to-Speech (TTS), Singing Voice Synthesis (SVS), and Singing Voice Conversion (SVC) models. It aims to provide an easy-to-understand and modular codebase for researchers and developers working on voice generation tasks. The project leverages diffusion models for voice generation, offering multi-speaker support and compatibility with 44.1kHz vocoders.
How It Works
The framework utilizes diffusion models to generate audio, offering a simpler and more decoupled code structure compared to its predecessor, diffsvc. It supports multi-speaker training, 44.1kHz vocoders, multi-machine/multi-device training, and half-precision training for improved speed and memory efficiency.
Quick Start & Requirements
pdm sync
).python tools/download_nsf_hifigan.py
).dataset
directory with train
and valid
subfolders. Feature extraction is performed using python tools/preprocessing/extract_features.py
.python tools/diffusion/train.py --config configs/svc_hubert_soft.py
. Multi-node training requires PyTorch Lightning setup.python tools/diffusion/inference.py --config [config] --checkpoint [checkpoint file] --input [input audio] --output [output audio]
. Gradio interface available.Highlighted Details
Maintenance & Community
The project is under active development. Contributions are welcome via issues or pull requests. Linting is required before submission (pdm run lint
). Real-time documentation can be generated (pdm run docs
).
Licensing & Compatibility
The NSF-HiFiGAN vocoder is licensed under CC BY-NC-SA 4.0. The primary license for Fish Diffusion is not explicitly stated in the README, but attribution is required for derivative works and results.
Limitations & Caveats
The project is under active development, and users are advised to back up their configuration files. The NSF-HiFiGAN vocoder's CC BY-NC-SA 4.0 license may restrict commercial use or redistribution of derived models.
5 months ago
1 day