DTLN by breizhn

Speech enhancement model using a dual-signal transformation LSTM network

Created 5 years ago

681 stars

Top 49.9% on SourcePulse

Project Summary

This repository provides a TensorFlow 2.x implementation of the Dual-signal Transformation LSTM Network (DTLN) for real-time speech denoising. It's designed for researchers and developers working on audio processing, noise suppression, and embedded systems, offering competitive performance with a small model footprint suitable for devices like the Raspberry Pi.

How It Works

DTLN combines a Short-Time Fourier Transform (STFT) with a learned analysis and synthesis basis in a stacked LSTM network. This approach leverages both magnitude spectral information and phase information from the learned basis, achieving state-of-the-art noise suppression with under one million parameters. The model is trained on extensive datasets, enabling real-time processing with low latency.

Quick Start & Requirements

Install: Use provided conda environment files (train_env.yml, eval_env.yml, tflite_env.yml).
Prerequisites: TensorFlow 2.x (GPU with 5GB+ VRAM recommended for training), librosa, wavinfo. CUDA 10.1+ and Nvidia driver 418+ recommended for training.
Evaluation: python run_evaluation.py -i <input_folder> -o <output_folder> -m ./pretrained_model/model.h5
Docs: DNS-Challenge, TF-lite runtime, keras2onnx

Highlighted Details

Achieves 3.04 PESQ, 94.76 STOI, and 16.34 dB SI-SDR on the DNS-Challenge non-reverberant test set.
Real-time processing demonstrated on Raspberry Pi 3 B+ with TF-lite quantized models achieving 2.2 ms execution time.
Supports SavedModel, TF-lite, and ONNX formats for deployment.
Model can be trained on as little as 40 hours of data with augmentation.

Maintenance & Community

Developed by Nils L. Westhausen (Carl von Ossietzky University of Oldenburg).
Actively seeking user projects and feedback.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

TF-lite and ONNX conversions require splitting the model due to LSTM state handling and lack of complex number support, necessitating external state management.
ONNX conversion is not supported on macOS.
Fixed sampling rate (16 kHz) and block parameters (32 ms block length, 8 ms shift); retraining is required to change these.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

11 stars in the last 30 days

Explore Similar Projects

nix-tts by rendchevi

Lightweight TTS research paper via module-wise distillation

Created 3 years ago

Updated 1 month ago

UniAudio by yangdongchao

Audio foundation model for universal audio generation

Created 2 years ago

Updated 1 year ago

MiraTTS by ysharma3501

Fast, high-fidelity TTS generation

Created 3 weeks ago

Updated 2 weeks ago

NBSS by Audio-WestlakeU

Speech separation research paper implementation

Created 4 years ago

Updated 1 year ago

tts by inworld-ai

TTS training framework for SpeechLM models

Created 8 months ago

Updated 3 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Chenlin Meng

Chenlin Meng(Cofounder of Pika).

diffwave by lmnt-com

Neural vocoder and waveform synthesizer

Created 5 years ago

Updated 1 year ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

speech-denoising-wavenet by drethage

Neural network for end-to-end speech denoising

Created 8 years ago

Updated 2 years ago

openWakeWord by dscripka

Open-source wakeword detection library for voice-enabled apps

Created 3 years ago

Updated 1 week ago

TransformerTTS by spring-media

TensorFlow 2 implementation for non-autoregressive text-to-speech

Created 5 years ago

Updated 1 year ago

Starred by

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

5 more.

ultravox by fixie-ai

Multimodal LLM for real-time voice interactions

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.