onnx-asr  by istupakov

Lightweight ONNX-based Automatic Speech Recognition (ASR)

Created 1 year ago
294 stars

Top 90.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

onnx-asr is a lightweight Python package designed for efficient Automatic Speech Recognition (ASR) using ONNX models. It targets engineers and researchers seeking a fast, easy-to-use ASR solution with minimal dependencies, deployable across diverse hardware from edge devices to servers. The package simplifies the integration of various state-of-the-art ASR models into applications without requiring heavy deep learning frameworks.

How It Works

The package leverages ONNX Runtime for high-performance inference, abstracting away complex deep learning frameworks like PyTorch or Transformers. It supports a wide array of ONNX-exported ASR architectures, including NeMo, Kaldi, Vosk, GigaAM, and Whisper, by providing necessary preprocessors and decoders. This approach enables cross-platform compatibility and efficient execution on various hardware accelerators, including CPUs and GPUs.

Quick Start & Requirements

Installation is straightforward via pip: pip install onnx-asr[cpu,hub] for CPU or pip install onnx-asr[gpu,hub] for GPU acceleration. GPU usage requires a compatible CUDA/TensorRT setup and potentially pip install onnxruntime-gpu[cuda,cudnn] tensorrt-cu12-libs. The package supports Python 3.10-3.14 and NumPy 1.22.4-2.4+. A demo is available on Hugging Face Spaces.

Highlighted Details

onnx-asr boasts broad hardware support, running on x86 and Arm CPUs, and accelerating with CUDA, TensorRT, CoreML, ROCm, and DirectML. It handles batch processing, long-form audio via Voice Activity Detection (VAD), and can output token-level timestamps and log probabilities. Quantized models are supported for enhanced performance. A simple CLI and a Gradio web interface are also provided.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is released under the permissive MIT License, generally allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

A known issue exists with onnxruntime version 1.24 regarding symlinks in the Hugging Face cache; users may need an older version or to specify download paths. Most models have a 20-30 second audio limit, necessitating VAD for longer inputs. Supported WAV formats are limited to PCM variants; other audio types require pre-conversion or use of libraries like soundfile. Some older onnx-community models may have broken fp16 precision.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
6
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

0.9%
2k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 months ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

moonshine by moonshine-ai

1.0%
8k
Speech-to-text models optimized for fast, accurate ASR on edge devices
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.