Qwen3-ASR by QwenLM

Advanced multilingual speech recognition and alignment

Created 5 months ago

3,055 stars

Top 15.2% on SourcePulse

Project Summary

Summary

Qwen3-ASR is an open-source ASR model family from Alibaba Cloud, offering robust multilingual speech, music, and song recognition. It features two all-in-one models (1.7B and 0.6B) supporting 52 languages/dialects and a novel non-autoregressive forced aligner for precise timestamp prediction. This suite delivers state-of-the-art open-source performance, competitive with commercial APIs, and advanced audio understanding capabilities.

How It Works

Built on large-scale speech data and the Qwen3-Omni foundation model, Qwen3-ASR employs two primary ASR models: the high-performance 1.7B version and the accuracy-efficient 0.6B version with high throughput. A key innovation is the Qwen3-ForcedAligner-0.6B, a non-autoregressive model providing superior timestamp accuracy for text-speech alignment across 11 languages. This architecture supports unified streaming and offline inference.

Quick Start & Requirements

Installation is via pip: pip install -U qwen-asr or pip install -U qwen-asr[vllm] for the vLLM backend. Python 3.12 is recommended. GPU acceleration is crucial; FlashAttention 2 (pip install -U flash-attn --no-build-isolation) is recommended for performance and memory efficiency, requiring compatible hardware and float16/bfloat16 dtypes. Official demos and examples are available on Hugging Face and ModelScope.

Highlighted Details

52-Language Multilingual Support: Covers speech, singing, and songs with BGM across 30 languages and 22 Chinese dialects.
SOTA Performance: The 1.7B model rivals commercial APIs and leads open-source benchmarks; the 0.6B model offers high throughput.
Accurate Forced Alignment: Qwen3-ForcedAligner-0.6B provides precise timestamp prediction for arbitrary speech units in 11 languages, surpassing E2E models.
Versatile Inference: Supports vLLM batching, streaming, asynchronous serving, and timestamp output via a comprehensive toolkit.

Maintenance & Community

Developed by Alibaba Cloud's Qwen team, with recent updates in January 2026. Community support is available via WeChat and Discord. Links to official blogs and demos are provided.

Licensing & Compatibility

The specific open-source license is not detailed in the README, necessitating further checks for commercial use or integration.

Limitations & Caveats

FlashAttention 2 has hardware and dtype prerequisites. The vLLM backend requires careful setup. Timestamp prediction relies on the separate Qwen3-ForcedAligner-0.6B model.

Qwen3-ASR by QwenLM

Explore Similar Projects

speech-recognition-uk by egorsmkv

ctc-segmentation by lumaku

kugelaudio-open by Kugelaudio

echogarden by echogarden-project

edgedict by theblackcat102

wav2vec2-live by oliverguhr

SimulStreaming by ufal

Fun-ASR by FunAudioLLM

ichigo by janhq

SenseVoice by FunAudioLLM

seamless_communication by facebookresearch

whisper by openai