Audio-Interaction by xzf-thu

Unified model for real-time, always-on audio interaction

Created 3 weeks ago

New!

428 stars

Top 68.5% on SourcePulse

Project Summary

This project introduces the Audio Interaction Model (AIM), addressing the limitations of current Large Audio Language Models (LALMs) that are offline or single-task. AIM offers a unified, always-on model for offline tasks, real-time streaming, and general instruction following. It benefits developers by enabling continuous, proactive, context-aware audio processing within a single system.

How It Works

AudioInteraction operates as a unified, always-on model that continuously processes audio frames, intelligently deciding when to speak. It maintains a ⟨Silent⟩ state, transitioning to ⟨Speak⟩ based on task or acoustic context. This design integrates ASR, S2TT, and AQA into a single, proactive perceive-decide-respond loop, moving beyond single-task or offline paradigms.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.12 Conda environment, and running pip install -r requirements.txt. PyTorch with CUDA and ffmpeg are prerequisites. Model weights are downloadable via python download.py. Inference can be run offline (infer_offline.py) or real-time (infer_online.py). A WebUI demo is available via web/server.py. Links to technical reports and demos are in the README.

Highlighted Details

Unified Streaming Model: Integrates offline and real-time audio tasks (ASR, S2TT, voice chatting, instruction following) into a single, always-on architecture.
Proactive Intervention: Detects critical audio events and issues warnings without explicit prompts.
Real-time Interaction: Delivers low-latency, incrementally corrected partial transcripts and translations.
Full-Spectrum Perception: Jointly processes speech, music, and environmental sounds for context-aware conversations.
StreamAudio-2M Dataset: Introduces a large-scale (~2.6M items) streaming instruction-following corpus.

Maintenance & Community

The project was recently released (May 2026) with no explicit details on maintainers, community channels, sponsorships, or a public roadmap provided in the README.

Licensing & Compatibility

Released under the Apache-2.0 License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Given its recent release, the project may be experimental. Specific hardware requirements (e.g., VRAM) are not detailed. Fine-tuning requires specific checkpoints (QWEN_OMNI_CKPT, AUDIO_TOWER_CKPT) which may need separate acquisition. Performance on all edge cases is not exhaustively documented.

Audio-Interaction by xzf-thu

Explore Similar Projects

whispering-ui by Sharrnah

izwi by izwi-ai

membrane_demo by membraneframework

UniAudio by yangdongchao

audiolab by deeeed

VITA-Audio by VITA-MLLM

vui by fluxions-ai

Step-Audio2 by stepfun-ai

SALMONN by bytedance

Qwen-Audio by QwenLM

mini-omni by gpt-omni

Kimi-Audio by MoonshotAI