onnx-asr by istupakov

Lightweight ONNX-based Automatic Speech Recognition (ASR)

Created 1 year ago

341 stars

Top 80.7% on SourcePulse

Project Summary

Summary

onnx-asr is a lightweight Python package designed for efficient Automatic Speech Recognition (ASR) using ONNX models. It targets engineers and researchers seeking a fast, easy-to-use ASR solution with minimal dependencies, deployable across diverse hardware from edge devices to servers. The package simplifies the integration of various state-of-the-art ASR models into applications without requiring heavy deep learning frameworks.

How It Works

The package leverages ONNX Runtime for high-performance inference, abstracting away complex deep learning frameworks like PyTorch or Transformers. It supports a wide array of ONNX-exported ASR architectures, including NeMo, Kaldi, Vosk, GigaAM, and Whisper, by providing necessary preprocessors and decoders. This approach enables cross-platform compatibility and efficient execution on various hardware accelerators, including CPUs and GPUs.

Quick Start & Requirements

Installation is straightforward via pip: pip install onnx-asr[cpu,hub] for CPU or pip install onnx-asr[gpu,hub] for GPU acceleration. GPU usage requires a compatible CUDA/TensorRT setup and potentially pip install onnxruntime-gpu[cuda,cudnn] tensorrt-cu12-libs. The package supports Python 3.10-3.14 and NumPy 1.22.4-2.4+. A demo is available on Hugging Face Spaces.

Highlighted Details

onnx-asr boasts broad hardware support, running on x86 and Arm CPUs, and accelerating with CUDA, TensorRT, CoreML, ROCm, and DirectML. It handles batch processing, long-form audio via Voice Activity Detection (VAD), and can output token-level timestamps and log probabilities. Quantized models are supported for enhanced performance. A simple CLI and a Gradio web interface are also provided.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is released under the permissive MIT License, generally allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

A known issue exists with onnxruntime version 1.24 regarding symlinks in the Hugging Face cache; users may need an older version or to specify download paths. Most models have a 20-30 second audio limit, necessitating VAD for longer inputs. Supported WAV formats are limited to PCM variants; other audio types require pre-conversion or use of libraries like soundfile. Some older onnx-community models may have broken fp16 precision.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days