qwen3-asr  by Quantatirsk

Local Speech Recognition API Service

Created 4 months ago
264 stars

Top 96.6% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Quantatirsk/funasr-api provides a ready-to-use, local speech recognition API service powered by FunASR and Qwen3-ASR. It supports 52 languages and offers compatibility with both OpenAI API and Alibaba Cloud Speech API standards. This project benefits engineers and researchers by enabling local deployment of advanced, multi-language ASR capabilities with features like speaker diarization and real-time streaming.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> The service integrates multiple state-of-the-art ASR models, including Qwen3-ASR (1.7B/0.6B) and Paraformer Large, leveraging vLLM for efficient Qwen3-ASR inference. It adopts familiar API interfaces (OpenAI, Alibaba Cloud) for seamless integration. Key features include CAM++ based speaker diarization, intelligent audio segmentation via VAD and greedy merging, and GPU batch processing for performance gains, making advanced ASR accessible locally.

Quick Start & Requirements

  • Primary install / run command (pip, Docker, binary, etc.).

  • Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).

  • Estimated setup time or resource footprint.

  • If they are present, include links to official quick-start, docs, demo, or other relevant pages.

  • Primary Install: Docker Deployment (Recommended).

  • Prerequisites: Python 3.10+, CUDA 12.6+ (for GPU acceleration), FFmpeg.

  • Resource Footprint: Minimum (CPU): 4 cores, 16GB RAM, 20GB disk. Recommended (GPU): 4 cores, 16GB RAM, NVIDIA GPU (16GB+ VRAM), 20GB disk.

  • Links:

    • Docker GPU: docker-compose up -d
    • Docker CPU: docker-compose -f docker-compose-cpu.yml up -d
    • API Endpoint: http://localhost:17003
    • API Docs: http://localhost:17003/docs

Highlighted Details

  • Multi-Model Support: Integrates Qwen3-ASR (1.7B/0.6B) and Paraformer Large models.
  • API Compatibility: Supports OpenAI API (/v1/audio/transcriptions) and Alibaba Cloud Speech API (RESTful, WebSocket).
  • Speaker Diarization: Automatic multi-speaker identification using CAM++ model, enabled by default.
  • Real-time Streaming: Supports WebSocket streaming with low latency.
  • Performance: GPU batch processing offers 2-3x speedup over sequential processing.
  • Intelligent Audio Processing: Smart far-field filtering and VAD-based audio segmentation.

Maintenance & Community

  • No specific details on contributors, sponsorships, or community channels (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> FunASR streaming does not support word-level timestamps or confidence scores. Qwen3 models require a GPU and vLLM backend; CPU environments automatically filter them. Word-level timestamps are exclusively available for Qwen3-ASR streaming.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
8
Star History
44 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

awesome-diarization by wq2012

0.1%
2k
List of resources for speaker diarization
Created 7 years ago
Updated 9 months ago
Feedback? Help us improve.