onnx-asr  by istupakov

Lightweight ONNX-based Automatic Speech Recognition (ASR)

Created 10 months ago
260 stars

Top 97.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

onnx-asr is a lightweight Python package designed for efficient Automatic Speech Recognition (ASR) using ONNX models. It targets engineers and researchers seeking a fast, easy-to-use ASR solution with minimal dependencies, deployable across diverse hardware from edge devices to servers. The package simplifies the integration of various state-of-the-art ASR models into applications without requiring heavy deep learning frameworks.

How It Works

The package leverages ONNX Runtime for high-performance inference, abstracting away complex deep learning frameworks like PyTorch or Transformers. It supports a wide array of ONNX-exported ASR architectures, including NeMo, Kaldi, Vosk, GigaAM, and Whisper, by providing necessary preprocessors and decoders. This approach enables cross-platform compatibility and efficient execution on various hardware accelerators, including CPUs and GPUs.

Quick Start & Requirements

Installation is straightforward via pip: pip install onnx-asr[cpu,hub] for CPU or pip install onnx-asr[gpu,hub] for GPU acceleration. GPU usage requires a compatible CUDA/TensorRT setup and potentially pip install onnxruntime-gpu[cuda,cudnn] tensorrt-cu12-libs. The package supports Python 3.10-3.14 and NumPy 1.22.4-2.4+. A demo is available on Hugging Face Spaces.

Highlighted Details

onnx-asr boasts broad hardware support, running on x86 and Arm CPUs, and accelerating with CUDA, TensorRT, CoreML, ROCm, and DirectML. It handles batch processing, long-form audio via Voice Activity Detection (VAD), and can output token-level timestamps and log probabilities. Quantized models are supported for enhanced performance. A simple CLI and a Gradio web interface are also provided.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is released under the permissive MIT License, generally allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

A known issue exists with onnxruntime version 1.24 regarding symlinks in the Hugging Face cache; users may need an older version or to specify download paths. Most models have a 20-30 second audio limit, necessitating VAD for longer inputs. Supported WAV formats are limited to PCM variants; other audio types require pre-conversion or use of libraries like soundfile. Some older onnx-community models may have broken fp16 precision.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
15
Issues (30d)
8
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

5.3%
1k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 weeks ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

moonshine by moonshine-ai

9.0%
4k
Speech-to-text models optimized for fast, accurate ASR on edge devices
Created 1 year ago
Updated 2 days ago
Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

RealtimeSTT by KoljaB

0.2%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.