Qwen2-Audio by QwenLM

Audio-language model for audio analysis and voice chat

Created 1 year ago

2,058 stars

Top 21.2% on SourcePulse

1 Expert Loves This Project

JustinLin610

Core Maintainer at Alibaba Qwen

Project Summary

Qwen2-Audio is an open-source large audio-language model from Alibaba Cloud, designed for versatile audio understanding and interaction. It supports both voice chat, enabling free-form spoken conversations, and audio analysis, where users can provide audio with text instructions for tasks like sound identification or speech translation. The models are suitable for researchers and developers working with audio data who need advanced speech and sound processing capabilities.

How It Works

Qwen2-Audio employs a three-stage training process, integrating audio and language understanding into a unified architecture. This approach allows it to process various audio signals and respond to speech instructions directly or perform detailed audio analysis based on textual prompts. The model is optimized for handling diverse audio inputs and generating relevant textual outputs.

Quick Start & Requirements

Installation: pip install git+https://github.com/huggingface/transformers is recommended to ensure compatibility.
Dependencies: Requires transformers, librosa, and potentially torch with CUDA support for GPU acceleration.
Usage: Examples provided for voice chat, audio analysis, and batch inference using Hugging Face Transformers.
Resources: Models perform best with audio clips under 30 seconds. GPU acceleration is implied for efficient inference.
Documentation: Links to Hugging Face models, demos, and technical reports are available.

Highlighted Details

Offers two models: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.
Evaluated on 13 standard benchmarks covering ASR, S2TT, SER, VSC, and various AIR-Bench tasks.
Provides comprehensive evaluation scripts for result reproduction.
Supports both voice chat and audio analysis interaction modes.

Maintenance & Community

Official releases on ModelScope and Hugging Face.
Technical reports and blog posts detailing progress and capabilities.
Contact information for research and product teams provided.

Licensing & Compatibility

License details are available within each model's Hugging Face repository.
Commercial usage does not require explicit requests.

Limitations & Caveats

The README notes potential score fluctuations after framework conversion to Hugging Face, recommending the use of initial model results from the paper for precise comparisons.
Optimal performance is noted for audio clips under 30 seconds.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

26 stars in the last 30 days

Explore Similar Projects

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 5 months ago

alibabacloud-bailian-speech-demo by aliyun

Speech AI SDK demos for AlibabaCloud Bailian

Created 1 year ago

Updated 2 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

LLaSM by LinkSoul-AI

Open-source speech-language assistant for multimodal conversation

Created 2 years ago

Updated 2 years ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

1 more.

vui by fluxions-ai

Conversational speech models for on-device use

Created 8 months ago

Updated 2 weeks ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Theo Browne

Theo Browne(Founder of Ping.gg), and

1 more.

dia2 by nari-labs

Streaming dialogue TTS for real-time conversational audio

Created 3 months ago

Updated 2 months ago

fast-voice-assistant by dsa

AI voice assistant demo with <500ms response

Created 1 year ago

Updated 1 year ago

Starred by

Teknium

Teknium(Cofounder of Nous Research).

ChatWaifu by cjyaddone

Chatbot for simulating conversations with waifu-style characters

Created 3 years ago

Updated 1 year ago

Babagaboosh by DougDougGithub

Simple app for verbal conversation with GPT-4o

Created 2 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Jinze Bai

Jinze Bai(Research Scientist at Alibaba Qwen), and

1 more.

Qwen-Audio by QwenLM

Audio-language model for audio understanding and chat

Created 2 years ago

Updated 1 year ago

ASR-LLM-TTS by ABexit

Speech interaction system integrating ASR, LLM, and TTS

Created 1 year ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

Kimi-Audio by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 10 months ago

Updated 8 months ago

sherpa-onnx by k2-fsa

Speech toolkit for local, offline speech AI tasks via ONNX

Created 3 years ago

Updated 1 day ago

Feedback? Help us improve.