Qwen3-ASR  by QwenLM

Advanced multilingual speech recognition and alignment

Created 4 weeks ago

New!

1,645 stars

Top 25.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Qwen3-ASR is an open-source ASR model family from Alibaba Cloud, offering robust multilingual speech, music, and song recognition. It features two all-in-one models (1.7B and 0.6B) supporting 52 languages/dialects and a novel non-autoregressive forced aligner for precise timestamp prediction. This suite delivers state-of-the-art open-source performance, competitive with commercial APIs, and advanced audio understanding capabilities.

How It Works

Built on large-scale speech data and the Qwen3-Omni foundation model, Qwen3-ASR employs two primary ASR models: the high-performance 1.7B version and the accuracy-efficient 0.6B version with high throughput. A key innovation is the Qwen3-ForcedAligner-0.6B, a non-autoregressive model providing superior timestamp accuracy for text-speech alignment across 11 languages. This architecture supports unified streaming and offline inference.

Quick Start & Requirements

Installation is via pip: pip install -U qwen-asr or pip install -U qwen-asr[vllm] for the vLLM backend. Python 3.12 is recommended. GPU acceleration is crucial; FlashAttention 2 (pip install -U flash-attn --no-build-isolation) is recommended for performance and memory efficiency, requiring compatible hardware and float16/bfloat16 dtypes. Official demos and examples are available on Hugging Face and ModelScope.

Highlighted Details

  • 52-Language Multilingual Support: Covers speech, singing, and songs with BGM across 30 languages and 22 Chinese dialects.
  • SOTA Performance: The 1.7B model rivals commercial APIs and leads open-source benchmarks; the 0.6B model offers high throughput.
  • Accurate Forced Alignment: Qwen3-ForcedAligner-0.6B provides precise timestamp prediction for arbitrary speech units in 11 languages, surpassing E2E models.
  • Versatile Inference: Supports vLLM batching, streaming, asynchronous serving, and timestamp output via a comprehensive toolkit.

Maintenance & Community

Developed by Alibaba Cloud's Qwen team, with recent updates in January 2026. Community support is available via WeChat and Discord. Links to official blogs and demos are provided.

Licensing & Compatibility

The specific open-source license is not detailed in the README, necessitating further checks for commercial use or integration.

Limitations & Caveats

FlashAttention 2 has hardware and dtype prerequisites. The vLLM backend requires careful setup. Timestamp prediction relies on the separate Qwen3-ForcedAligner-0.6B model.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
38
Star History
1,665 stars in the last 28 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
43 more.

whisper by openai

0.3%
95k
Speech recognition model for multilingual transcription/translation
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.