DeepXi by anicolson

TensorFlow/Keras code for a priori SNR estimation (speech enhancement, robust ASR)

created 7 years ago

513 stars

Top 61.8% on sourcepulse

Project Summary

Deep Xi is a TensorFlow 2/Keras framework for estimating a priori Signal-to-Noise Ratio (SNR) for speech enhancement and robust Automatic Speech Recognition (ASR). It targets researchers and engineers in audio processing and speech technology, offering a deep learning approach to improve speech quality and intelligibility.

How It Works

Deep Xi utilizes deep neural networks to predict a mapped version of the a priori SNR from the noisy speech's short-time magnitude spectrum. The mapping uses the cumulative distribution function (CDF) of the instantaneous a priori SNR, computed from training data statistics, to improve convergence. During inference, the estimated a priori SNR is recovered using sample statistics. This approach allows for flexible integration into various speech processing pipelines, including MMSE-based enhancement and mask estimation.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
GPU usage requires CUDA 10.1 and cuDNN >= 7.6.
A Docker image is available on Docker Hub.
Configuration and execution are managed via run.sh.
Official datasets are available on IEEE DataPort.

Highlighted Details

Supports multiple network architectures: MHANet (Multi-head attention), RDLNet (Residual-dense lattice), ResNet, ResLSTM, and ResBiLSTM.
Achieves state-of-the-art results on DEMAND Voicebank and Deep Xi test sets across various objective metrics (CSIG, CBAK, COVL, PESQ, STOI).
Offers both causal and non-causal model variants.
Can be used for noise PSD estimation via the DeepMMSE component.

Maintenance & Community

The project is associated with multiple research papers, indicating active development and academic backing. Links to relevant papers and datasets are provided.

Licensing & Compatibility

The repository does not explicitly state a license. However, the inclusion of research papers and datasets suggests a focus on academic use. Commercial use would require clarification of licensing terms.

Limitations & Caveats

The ResLSTM network's performance is noted as not meeting expectations compared to TensorFlow 1.x implementations. The project primarily targets single-channel audio and a 16kHz sampling frequency, though these can be configured.

Health Check

Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 90 days