qwen3-asr by Quantatirsk

Local Speech Recognition API Service

Created 4 months ago

264 stars

Top 96.6% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Quantatirsk/funasr-api provides a ready-to-use, local speech recognition API service powered by FunASR and Qwen3-ASR. It supports 52 languages and offers compatibility with both OpenAI API and Alibaba Cloud Speech API standards. This project benefits engineers and researchers by enabling local deployment of advanced, multi-language ASR capabilities with features like speaker diarization and real-time streaming.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> The service integrates multiple state-of-the-art ASR models, including Qwen3-ASR (1.7B/0.6B) and Paraformer Large, leveraging vLLM for efficient Qwen3-ASR inference. It adopts familiar API interfaces (OpenAI, Alibaba Cloud) for seamless integration. Key features include CAM++ based speaker diarization, intelligent audio segmentation via VAD and greedy merging, and GPU batch processing for performance gains, making advanced ASR accessible locally.

Quick Start & Requirements

Primary install / run command (pip, Docker, binary, etc.).
Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
Estimated setup time or resource footprint.
If they are present, include links to official quick-start, docs, demo, or other relevant pages.
Primary Install: Docker Deployment (Recommended).
Prerequisites: Python 3.10+, CUDA 12.6+ (for GPU acceleration), FFmpeg.
Resource Footprint: Minimum (CPU): 4 cores, 16GB RAM, 20GB disk. Recommended (GPU): 4 cores, 16GB RAM, NVIDIA GPU (16GB+ VRAM), 20GB disk.
Links:
- Docker GPU: docker-compose up -d
- Docker CPU: docker-compose -f docker-compose-cpu.yml up -d
- API Endpoint: http://localhost:17003
- API Docs: http://localhost:17003/docs

Highlighted Details

Multi-Model Support: Integrates Qwen3-ASR (1.7B/0.6B) and Paraformer Large models.
API Compatibility: Supports OpenAI API (/v1/audio/transcriptions) and Alibaba Cloud Speech API (RESTful, WebSocket).
Speaker Diarization: Automatic multi-speaker identification using CAM++ model, enabled by default.
Real-time Streaming: Supports WebSocket streaming with low latency.
Performance: GPU batch processing offers 2-3x speedup over sequential processing.
Intelligent Audio Processing: Smart far-field filtering and VAD-based audio segmentation.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels (Discord/Slack) are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> FunASR streaming does not support word-level timestamps or confidence scores. Qwen3 models require a GPU and vLLM backend; CPU environments automatically filter them. Word-level timestamps are exclusively available for Qwen3-ASR streaming.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days