Audio-Interaction  by xzf-thu

Unified model for real-time, always-on audio interaction

Created 3 weeks ago

New!

428 stars

Top 68.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces the Audio Interaction Model (AIM), addressing the limitations of current Large Audio Language Models (LALMs) that are offline or single-task. AIM offers a unified, always-on model for offline tasks, real-time streaming, and general instruction following. It benefits developers by enabling continuous, proactive, context-aware audio processing within a single system.

How It Works

AudioInteraction operates as a unified, always-on model that continuously processes audio frames, intelligently deciding when to speak. It maintains a ⟨Silent⟩ state, transitioning to ⟨Speak⟩ based on task or acoustic context. This design integrates ASR, S2TT, and AQA into a single, proactive perceive-decide-respond loop, moving beyond single-task or offline paradigms.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.12 Conda environment, and running pip install -r requirements.txt. PyTorch with CUDA and ffmpeg are prerequisites. Model weights are downloadable via python download.py. Inference can be run offline (infer_offline.py) or real-time (infer_online.py). A WebUI demo is available via web/server.py. Links to technical reports and demos are in the README.

Highlighted Details

  • Unified Streaming Model: Integrates offline and real-time audio tasks (ASR, S2TT, voice chatting, instruction following) into a single, always-on architecture.
  • Proactive Intervention: Detects critical audio events and issues warnings without explicit prompts.
  • Real-time Interaction: Delivers low-latency, incrementally corrected partial transcripts and translations.
  • Full-Spectrum Perception: Jointly processes speech, music, and environmental sounds for context-aware conversations.
  • StreamAudio-2M Dataset: Introduces a large-scale (~2.6M items) streaming instruction-following corpus.

Maintenance & Community

The project was recently released (May 2026) with no explicit details on maintainers, community channels, sponsorships, or a public roadmap provided in the README.

Licensing & Compatibility

Released under the Apache-2.0 License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Given its recent release, the project may be experimental. Specific hardware requirements (e.g., VRAM) are not detailed. Fine-tuning requires specific checkpoints (QWEN_OMNI_CKPT, AUDIO_TOWER_CKPT) which may need separate acquisition. Performance on all edge cases is not exhaustively documented.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
428 stars in the last 27 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.1%
2k
Audio-language model for audio understanding and chat
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.