DTLN  by breizhn

Speech enhancement model using a dual-signal transformation LSTM network

Created 5 years ago
645 stars

Top 51.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a TensorFlow 2.x implementation of the Dual-signal Transformation LSTM Network (DTLN) for real-time speech denoising. It's designed for researchers and developers working on audio processing, noise suppression, and embedded systems, offering competitive performance with a small model footprint suitable for devices like the Raspberry Pi.

How It Works

DTLN combines a Short-Time Fourier Transform (STFT) with a learned analysis and synthesis basis in a stacked LSTM network. This approach leverages both magnitude spectral information and phase information from the learned basis, achieving state-of-the-art noise suppression with under one million parameters. The model is trained on extensive datasets, enabling real-time processing with low latency.

Quick Start & Requirements

  • Install: Use provided conda environment files (train_env.yml, eval_env.yml, tflite_env.yml).
  • Prerequisites: TensorFlow 2.x (GPU with 5GB+ VRAM recommended for training), librosa, wavinfo. CUDA 10.1+ and Nvidia driver 418+ recommended for training.
  • Evaluation: python run_evaluation.py -i <input_folder> -o <output_folder> -m ./pretrained_model/model.h5
  • Docs: DNS-Challenge, TF-lite runtime, keras2onnx

Highlighted Details

  • Achieves 3.04 PESQ, 94.76 STOI, and 16.34 dB SI-SDR on the DNS-Challenge non-reverberant test set.
  • Real-time processing demonstrated on Raspberry Pi 3 B+ with TF-lite quantized models achieving 2.2 ms execution time.
  • Supports SavedModel, TF-lite, and ONNX formats for deployment.
  • Model can be trained on as little as 40 hours of data with augmentation.

Maintenance & Community

  • Developed by Nils L. Westhausen (Carl von Ossietzky University of Oldenburg).
  • Actively seeking user projects and feedback.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

  • TF-lite and ONNX conversions require splitting the model due to LSTM state handling and lack of complex number support, necessitating external state management.
  • ONNX conversion is not supported on macOS.
  • Fixed sampling rate (16 kHz) and block parameters (32 ms block length, 8 ms shift); retraining is required to change these.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.