DTLN  by breizhn

Speech enhancement model using a dual-signal transformation LSTM network

Created 5 years ago
681 stars

Top 49.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a TensorFlow 2.x implementation of the Dual-signal Transformation LSTM Network (DTLN) for real-time speech denoising. It's designed for researchers and developers working on audio processing, noise suppression, and embedded systems, offering competitive performance with a small model footprint suitable for devices like the Raspberry Pi.

How It Works

DTLN combines a Short-Time Fourier Transform (STFT) with a learned analysis and synthesis basis in a stacked LSTM network. This approach leverages both magnitude spectral information and phase information from the learned basis, achieving state-of-the-art noise suppression with under one million parameters. The model is trained on extensive datasets, enabling real-time processing with low latency.

Quick Start & Requirements

  • Install: Use provided conda environment files (train_env.yml, eval_env.yml, tflite_env.yml).
  • Prerequisites: TensorFlow 2.x (GPU with 5GB+ VRAM recommended for training), librosa, wavinfo. CUDA 10.1+ and Nvidia driver 418+ recommended for training.
  • Evaluation: python run_evaluation.py -i <input_folder> -o <output_folder> -m ./pretrained_model/model.h5
  • Docs: DNS-Challenge, TF-lite runtime, keras2onnx

Highlighted Details

  • Achieves 3.04 PESQ, 94.76 STOI, and 16.34 dB SI-SDR on the DNS-Challenge non-reverberant test set.
  • Real-time processing demonstrated on Raspberry Pi 3 B+ with TF-lite quantized models achieving 2.2 ms execution time.
  • Supports SavedModel, TF-lite, and ONNX formats for deployment.
  • Model can be trained on as little as 40 hours of data with augmentation.

Maintenance & Community

  • Developed by Nils L. Westhausen (Carl von Ossietzky University of Oldenburg).
  • Actively seeking user projects and feedback.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

  • TF-lite and ONNX conversions require splitting the model due to LSTM state handling and lack of complex number support, necessitating external state management.
  • ONNX conversion is not supported on macOS.
  • Fixed sampling rate (16 kHz) and block parameters (32 ms block length, 8 ms shift); retraining is required to change these.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.1%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.