DeepXi  by anicolson

TensorFlow/Keras code for a priori SNR estimation (speech enhancement, robust ASR)

created 7 years ago
513 stars

Top 61.8% on sourcepulse

GitHubView on GitHub
Project Summary

Deep Xi is a TensorFlow 2/Keras framework for estimating a priori Signal-to-Noise Ratio (SNR) for speech enhancement and robust Automatic Speech Recognition (ASR). It targets researchers and engineers in audio processing and speech technology, offering a deep learning approach to improve speech quality and intelligibility.

How It Works

Deep Xi utilizes deep neural networks to predict a mapped version of the a priori SNR from the noisy speech's short-time magnitude spectrum. The mapping uses the cumulative distribution function (CDF) of the instantaneous a priori SNR, computed from training data statistics, to improve convergence. During inference, the estimated a priori SNR is recovered using sample statistics. This approach allows for flexible integration into various speech processing pipelines, including MMSE-based enhancement and mask estimation.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • GPU usage requires CUDA 10.1 and cuDNN >= 7.6.
  • A Docker image is available on Docker Hub.
  • Configuration and execution are managed via run.sh.
  • Official datasets are available on IEEE DataPort.

Highlighted Details

  • Supports multiple network architectures: MHANet (Multi-head attention), RDLNet (Residual-dense lattice), ResNet, ResLSTM, and ResBiLSTM.
  • Achieves state-of-the-art results on DEMAND Voicebank and Deep Xi test sets across various objective metrics (CSIG, CBAK, COVL, PESQ, STOI).
  • Offers both causal and non-causal model variants.
  • Can be used for noise PSD estimation via the DeepMMSE component.

Maintenance & Community

The project is associated with multiple research papers, indicating active development and academic backing. Links to relevant papers and datasets are provided.

Licensing & Compatibility

The repository does not explicitly state a license. However, the inclusion of research papers and datasets suggests a focus on academic use. Commercial use would require clarification of licensing terms.

Limitations & Caveats

The ResLSTM network's performance is noted as not meeting expectations compared to TensorFlow 1.x implementations. The project primarily targets single-channel audio and a 16kHz sampling frequency, though these can be configured.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.