Qwen-Audio  by QwenLM

Audio-language model for audio understanding and chat

created 1 year ago
1,751 stars

Top 25.0% on sourcepulse

GitHubView on GitHub
Project Summary

Qwen-Audio is a foundational multimodal large language model for universal audio understanding, capable of processing diverse audio types (speech, sound, music) and text to generate text outputs. It targets researchers and developers seeking a versatile audio processing solution, offering state-of-the-art performance across multiple benchmarks without task-specific fine-tuning.

How It Works

Qwen-Audio is built upon the Qwen-7B LLM and Whisper-large-v2 audio encoder. It employs a multi-task learning framework to handle variations in textual labels across datasets, enabling knowledge sharing and improved performance on tasks like speech recognition, audio captioning, and acoustic scene classification. Qwen-Audio-Chat is a fine-tuned version for conversational AI, supporting multi-turn dialogues and audio-oriented interactions.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (for GPU), FFmpeg.
  • Usage: Examples provided for Hugging Face Transformers and ModelScope.
  • Docs: TUTORIAL.md, FAQ.md

Highlighted Details

  • Achieves state-of-the-art (SOTA) results on benchmarks including Aishell1, cochlscene, ClothoAQA, and VocalSound.
  • Supports 12 standard audio benchmarks, including speech recognition, speech-to-text translation, audio captioning, and acoustic scene classification.
  • Qwen-Audio-Chat enables multi-turn dialogues, audio analysis, sound reasoning, and music appreciation.

Maintenance & Community

  • Checkpoints released on ModelScope and Hugging Face (Nov 30, 2023).
  • Paper available: arXiv:2311.07919.
  • Contact: qianwen_opensource@alibabacloud.com for research/product teams.

Licensing & Compatibility

  • Permissive license allowing free use for research and commercial purposes.

Limitations & Caveats

  • Models perform best with audio clips under 30 seconds.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
79 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.