Qwen3-ASR-Toolkit  by QwenLM

High-throughput ASR toolkit for long audio and video

Created 4 weeks ago

New!

624 stars

Top 53.1% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit addresses the limitation of the Qwen-ASR API's 3-minute audio duration constraint, enabling high-throughput transcription of long audio and video files. It is designed for users who need to process extensive media content efficiently, offering robust transcription with intelligent splitting and parallel processing to significantly reduce turnaround time.

How It Works

The toolkit employs a pipeline that first loads media files or URLs. It then utilizes Voice Activity Detection (VAD) to intelligently split audio into manageable chunks at natural silent pauses, ensuring segments remain under the 3-minute API limit and avoid cutting sentences mid-word. These chunks are processed concurrently via multi-threading using the DashScope Qwen-ASR API. Finally, the transcribed segments are aggregated, re-ordered, and cleaned through intelligent post-processing to remove common ASR hallucinations and repetitions, with automatic audio resampling to the required 16kHz mono format.

Quick Start & Requirements

  • Primary install: pip install qwen3-asr-toolkit
  • Prerequisites: Python 3.8+, FFmpeg (must be installed and in PATH), and a DashScope API Key (recommended as the DASHSCOPE_API_KEY environment variable).
  • Usage: qwen3-asr -i <input_file_or_url> [options]
  • Links: Installation and usage examples are provided within the README.

Highlighted Details

  • Bypasses the official Qwen-ASR API's 3-minute audio length limitation.
  • Features smart audio splitting using VAD for natural chunking.
  • Achieves high-speed transcription through parallel API calls.
  • Includes intelligent post-processing to clean transcripts and remove artifacts.
  • Supports universal media formats via FFmpeg and handles automatic audio resampling.

Maintenance & Community

Contributions are welcomed via pull requests and issues. The README does not specify details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The project is licensed under the MIT License, which permits broad usage, including commercial applications and linking within closed-source projects.

Limitations & Caveats

The toolkit requires FFmpeg to be installed separately. Transcription accuracy and efficiency may depend on the quality of the audio input and the effectiveness of the VAD and post-processing algorithms. Usage of the DashScope API may incur costs.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
12
Star History
632 stars in the last 29 days

Explore Similar Projects

Feedback? Help us improve.