Qwen-Audio  by QwenLM

Audio-language model for audio understanding and chat

Created 1 year ago
1,791 stars

Top 24.0% on SourcePulse

GitHubView on GitHub
Project Summary

Qwen-Audio is a foundational multimodal large language model for universal audio understanding, capable of processing diverse audio types (speech, sound, music) and text to generate text outputs. It targets researchers and developers seeking a versatile audio processing solution, offering state-of-the-art performance across multiple benchmarks without task-specific fine-tuning.

How It Works

Qwen-Audio is built upon the Qwen-7B LLM and Whisper-large-v2 audio encoder. It employs a multi-task learning framework to handle variations in textual labels across datasets, enabling knowledge sharing and improved performance on tasks like speech recognition, audio captioning, and acoustic scene classification. Qwen-Audio-Chat is a fine-tuned version for conversational AI, supporting multi-turn dialogues and audio-oriented interactions.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (for GPU), FFmpeg.
  • Usage: Examples provided for Hugging Face Transformers and ModelScope.
  • Docs: TUTORIAL.md, FAQ.md

Highlighted Details

  • Achieves state-of-the-art (SOTA) results on benchmarks including Aishell1, cochlscene, ClothoAQA, and VocalSound.
  • Supports 12 standard audio benchmarks, including speech recognition, speech-to-text translation, audio captioning, and acoustic scene classification.
  • Qwen-Audio-Chat enables multi-turn dialogues, audio analysis, sound reasoning, and music appreciation.

Maintenance & Community

  • Checkpoints released on ModelScope and Hugging Face (Nov 30, 2023).
  • Paper available: arXiv:2311.07919.
  • Contact: qianwen_opensource@alibabacloud.com for research/product teams.

Licensing & Compatibility

  • Permissive license allowing free use for research and commercial purposes.

Limitations & Caveats

  • Models perform best with audio clips under 30 seconds.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
28 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.