Qwen3-ASR-Toolkit by QwenLM

High-throughput ASR toolkit for long audio and video

Created 2 months ago

693 stars

Top 49.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

This toolkit addresses the limitation of the Qwen-ASR API's 3-minute audio duration constraint, enabling high-throughput transcription of long audio and video files. It is designed for users who need to process extensive media content efficiently, offering robust transcription with intelligent splitting and parallel processing to significantly reduce turnaround time.

How It Works

The toolkit employs a pipeline that first loads media files or URLs. It then utilizes Voice Activity Detection (VAD) to intelligently split audio into manageable chunks at natural silent pauses, ensuring segments remain under the 3-minute API limit and avoid cutting sentences mid-word. These chunks are processed concurrently via multi-threading using the DashScope Qwen-ASR API. Finally, the transcribed segments are aggregated, re-ordered, and cleaned through intelligent post-processing to remove common ASR hallucinations and repetitions, with automatic audio resampling to the required 16kHz mono format.

Quick Start & Requirements

Primary install: pip install qwen3-asr-toolkit
Prerequisites: Python 3.8+, FFmpeg (must be installed and in PATH), and a DashScope API Key (recommended as the DASHSCOPE_API_KEY environment variable).
Usage: qwen3-asr -i <input_file_or_url> [options]
Links: Installation and usage examples are provided within the README.

Highlighted Details

Bypasses the official Qwen-ASR API's 3-minute audio length limitation.
Features smart audio splitting using VAD for natural chunking.
Achieves high-speed transcription through parallel API calls.
Includes intelligent post-processing to clean transcripts and remove artifacts.
Supports universal media formats via FFmpeg and handles automatic audio resampling.

Maintenance & Community

Contributions are welcomed via pull requests and issues. The README does not specify details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The project is licensed under the MIT License, which permits broad usage, including commercial applications and linking within closed-source projects.

Limitations & Caveats

The toolkit requires FFmpeg to be installed separately. Transcription accuracy and efficiency may depend on the quality of the audio input and the effectiveness of the VAD and post-processing algorithms. Usage of the DashScope API may incur costs.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days