Discover and explore top open-source AI tools and projects—updated daily.
High-throughput ASR toolkit for long audio and video
New!
Top 53.1% on SourcePulse
This toolkit addresses the limitation of the Qwen-ASR API's 3-minute audio duration constraint, enabling high-throughput transcription of long audio and video files. It is designed for users who need to process extensive media content efficiently, offering robust transcription with intelligent splitting and parallel processing to significantly reduce turnaround time.
How It Works
The toolkit employs a pipeline that first loads media files or URLs. It then utilizes Voice Activity Detection (VAD) to intelligently split audio into manageable chunks at natural silent pauses, ensuring segments remain under the 3-minute API limit and avoid cutting sentences mid-word. These chunks are processed concurrently via multi-threading using the DashScope Qwen-ASR API. Finally, the transcribed segments are aggregated, re-ordered, and cleaned through intelligent post-processing to remove common ASR hallucinations and repetitions, with automatic audio resampling to the required 16kHz mono format.
Quick Start & Requirements
pip install qwen3-asr-toolkit
DASHSCOPE_API_KEY
environment variable).qwen3-asr -i <input_file_or_url> [options]
Highlighted Details
Maintenance & Community
Contributions are welcomed via pull requests and issues. The README does not specify details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.
Licensing & Compatibility
The project is licensed under the MIT License, which permits broad usage, including commercial applications and linking within closed-source projects.
Limitations & Caveats
The toolkit requires FFmpeg to be installed separately. Transcription accuracy and efficiency may depend on the quality of the audio input and the effectiveness of the VAD and post-processing algorithms. Usage of the DashScope API may incur costs.
3 weeks ago
Inactive