QwenASRMiniTool by dseditor

Local ASR tool for real-time transcription and subtitle generation

Created 4 months ago

335 stars

Top 81.8% on SourcePulse

Project Summary

Summary

This project provides a lightweight, local speech-to-text tool for generating subtitles from audio/video files and real-time microphone input. It targets both casual users needing a simple solution and power users seeking high accuracy via GPU acceleration, offering a free and accessible ASR experience.

How It Works

The tool employs Qwen3-ASR models, featuring a CPU-only mode with OpenVINO INT8 quantization and a GPU mode leveraging Vulkan via chatllm.cpp for enhanced accuracy with 1.7B GGUF models. It incorporates automatic voice activity detection (VAD), speaker diarization, multi-language support (30+), recognition hints, and an integrated subtitle editor.

Quick Start & Requirements

Portable EXE versions offer out-of-the-box functionality with automatic model downloads (~1.2 GB for CPU, ~2.3 GB for GPU). Source installation requires Python 3.10+ and git clone. Video processing relies on ffmpeg, which can be auto-downloaded. Minimum requirements are Windows 10/11 (64-bit), 6GB RAM for CPU, and a Vulkan 1.2+ compatible GPU with 8GB RAM for GPU mode.

Highlighted Details

Hybrid CPU/GPU: Seamlessly switch between efficient CPU inference (OpenVINO INT8) and high-accuracy GPU acceleration (Vulkan).
Broad GPU Support: Vulkan backend via chatllm.cpp supports NVIDIA, AMD, and Intel GPUs without CUDA/ROCm dependencies.
End-to-End Workflow: Handles audio/video input, real-time transcription, speaker labeling, multi-language recognition, and subtitle editing.
User-Friendly Setup: Portable builds and optional ffmpeg installer simplify deployment.

Maintenance & Community

Recent updates indicate active development, with features added throughout early 2024. No specific community channels or contributor details are provided.

Licensing & Compatibility

The core project code is MIT licensed. It integrates ffmpeg (GPL) for video processing and uses pre-compiled binaries from chatllm.cpp (MIT) for its Vulkan GPU backend. Model weights are subject to their respective source licenses. Compatibility is primarily for Windows 10/11 (64-bit).

Limitations & Caveats

The CPU-only mode offers "ordinary" recognition rates. Real-time transcription processes during pauses, not true streaming. Speaker diarization is less effective with simultaneous speech. The application path must not contain Chinese characters. CUDA support is in maintenance mode; Vulkan is the primary GPU focus.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

34 stars in the last 30 days