HiFTNet  by yl4579

Fast, high-quality neural vocoder for speech synthesis

Created 2 years ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

HiFTNet is a neural vocoder designed for fast, high-quality speech synthesis from mel-spectrograms. It addresses the computational and parameter inefficiencies of prior GAN-based models like HiFi-GAN and BigVGAN. Targeting researchers and developers in speech synthesis, HiFTNet offers a significant speed-up and parameter reduction while achieving state-of-the-art or ground-truth-level audio quality, enabling real-time applications.

How It Works

HiFTNet extends the iSTFTNet architecture by incorporating a novel harmonic-plus-noise source filter operating in the time-frequency domain. This filter leverages a sinusoidal source derived from a fundamental frequency (F0) estimated by a pre-trained network. This design choice allows for rapid inference, significantly reducing computational load and model size compared to traditional GAN vocoders, while maintaining high fidelity.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/yl4579/HiFTNet.git), navigate into the directory, and install Python requirements (pip install -r requirements.txt).
  • Prerequisites: Python >= 3.7. A pre-trained F0 estimation model is available from yl4579/PitchExtractor.
  • Resources: Pre-trained models for LJSpeech and LibriTTS are provided.
  • Documentation: Inference details are available in inference.ipynb. Audio samples can be found at https://hiftnet.github.io/. The research paper is available at https://arxiv.org/abs/2309.09493.

Highlighted Details

  • Achieves ground-truth-level performance on LJSpeech, outperforming iSTFTNet and HiFi-GAN.
  • Outperforms BigVGAN-base on LibriTTS for unseen speakers.
  • Delivers comparable quality to BigVGAN but is 4x faster and uses 1/6 the parameters.
  • Establishes a new benchmark for efficient, high-quality neural vocoding.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps are present in the provided README.

Licensing & Compatibility

The README does not specify a software license. This absence requires clarification for any adoption decision, particularly concerning commercial use or integration into proprietary systems.

Limitations & Caveats

The vocoder's performance is critically dependent on the accuracy of the fundamental frequency (F0) estimation. For optimal results, especially with noisy audio or non-speech content, training a dedicated F0 model is recommended.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

AudioGPT by AIGC-Audio

0.0%
10k
Audio processing and generation research project
Created 3 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
14 more.

Real-Time-Voice-Cloning by CorentinJ

0.1%
60k
Voice cloning for real-time speech generation
Created 6 years ago
Updated 1 month ago
Feedback? Help us improve.