Qwen-Audio by QwenLM

Audio-language model for audio understanding and chat

Created 2 years ago

1,873 stars

Top 22.8% on SourcePulse

3 Experts Love This Project

JustinLin610

Core Maintainer at Alibaba Qwen

jinze1994

Research Scientist at Alibaba Qwen

huybery

Research Scientist at Alibaba Qwen

Project Summary

Qwen-Audio is a foundational multimodal large language model for universal audio understanding, capable of processing diverse audio types (speech, sound, music) and text to generate text outputs. It targets researchers and developers seeking a versatile audio processing solution, offering state-of-the-art performance across multiple benchmarks without task-specific fine-tuning.

How It Works

Qwen-Audio is built upon the Qwen-7B LLM and Whisper-large-v2 audio encoder. It employs a multi-task learning framework to handle variations in textual labels across datasets, enabling knowledge sharing and improved performance on tasks like speech recognition, audio captioning, and acoustic scene classification. Qwen-Audio-Chat is a fine-tuned version for conversational AI, supporting multi-turn dialogues and audio-oriented interactions.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (for GPU), FFmpeg.
Usage: Examples provided for Hugging Face Transformers and ModelScope.
Docs: TUTORIAL.md, FAQ.md

Highlighted Details

Achieves state-of-the-art (SOTA) results on benchmarks including Aishell1, cochlscene, ClothoAQA, and VocalSound.
Supports 12 standard audio benchmarks, including speech recognition, speech-to-text translation, audio captioning, and acoustic scene classification.
Qwen-Audio-Chat enables multi-turn dialogues, audio analysis, sound reasoning, and music appreciation.

Maintenance & Community

Checkpoints released on ModelScope and Hugging Face (Nov 30, 2023).
Paper available: arXiv:2311.07919.
Contact: qianwen_opensource@alibabacloud.com for research/product teams.

Licensing & Compatibility

Permissive license allowing free use for research and commercial purposes.

Limitations & Caveats

Models perform best with audio clips under 30 seconds.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

11 stars in the last 30 days

Explore Similar Projects

UniAudio2 by yangdongchao

Audio foundation model unifies speech, sound, and music processing

Created 3 weeks ago

Updated 1 week ago

OSUM by ASLP-lab

Open Speech Understanding Model research paper

Created 1 year ago

Updated 3 months ago

UniAudio by yangdongchao

Audio foundation model for universal audio generation

Created 2 years ago

Updated 1 year ago

ltu by YuanGongND

Audio/speech LLM for perception and understanding, supporting open-ended questions

Created 2 years ago

Updated 1 year ago

awesome-large-audio-models by EmulationAI

Curated list of Large Language Models in Audio AI

Created 2 years ago

Updated 4 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

LLaSM by LinkSoul-AI

Open-source speech-language assistant for multimodal conversation

Created 2 years ago

Updated 2 years ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

1 more.

vui by fluxions-ai

Conversational speech models for on-device use

Created 8 months ago

Updated 2 weeks ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

Step-Audio2 by stepfun-ai

End-to-end audio understanding and speech conversation model

Created 7 months ago

Updated 1 week ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

SALMONN by bytedance

Multimodal LLM for speech, audio events, and music inputs

Created 2 years ago

Updated 3 weeks ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

Qwen2-Audio by QwenLM

Audio-language model for audio analysis and voice chat

Created 1 year ago

Updated 10 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

mini-omni by gpt-omni

Open-source multimodal LLM for real-time speech interaction

Created 1 year ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

Kimi-Audio by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 10 months ago

Updated 8 months ago

Feedback? Help us improve.