HiFTNet by yl4579

Fast, high-quality neural vocoder for speech synthesis

Created 2 years ago

254 stars

Top 99.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andreas Jansson

Cofounder of Replicate

Project Summary

Summary

HiFTNet is a neural vocoder designed for fast, high-quality speech synthesis from mel-spectrograms. It addresses the computational and parameter inefficiencies of prior GAN-based models like HiFi-GAN and BigVGAN. Targeting researchers and developers in speech synthesis, HiFTNet offers a significant speed-up and parameter reduction while achieving state-of-the-art or ground-truth-level audio quality, enabling real-time applications.

How It Works

HiFTNet extends the iSTFTNet architecture by incorporating a novel harmonic-plus-noise source filter operating in the time-frequency domain. This filter leverages a sinusoidal source derived from a fundamental frequency (F0) estimated by a pre-trained network. This design choice allows for rapid inference, significantly reducing computational load and model size compared to traditional GAN vocoders, while maintaining high fidelity.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/yl4579/HiFTNet.git), navigate into the directory, and install Python requirements (pip install -r requirements.txt).
Prerequisites: Python >= 3.7. A pre-trained F0 estimation model is available from yl4579/PitchExtractor.
Resources: Pre-trained models for LJSpeech and LibriTTS are provided.
Documentation: Inference details are available in inference.ipynb. Audio samples can be found at https://hiftnet.github.io/. The research paper is available at https://arxiv.org/abs/2309.09493.

Highlighted Details

Achieves ground-truth-level performance on LJSpeech, outperforming iSTFTNet and HiFi-GAN.
Outperforms BigVGAN-base on LibriTTS for unseen speakers.
Delivers comparable quality to BigVGAN but is 4x faster and uses 1/6 the parameters.
Establishes a new benchmark for efficient, high-quality neural vocoding.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps are present in the provided README.

Licensing & Compatibility

The README does not specify a software license. This absence requires clarification for any adoption decision, particularly concerning commercial use or integration into proprietary systems.

Limitations & Caveats

The vocoder's performance is critically dependent on the accuracy of the fundamental frequency (F0) estimation. For optimal results, especially with noisy audio or non-speech content, training a dedicated F0 model is recommended.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days