Audio-Omni by ZeyueT

Unified multimodal audio framework for generation and editing

Created 3 months ago

394 stars

Top 72.8% on SourcePulse

Project Summary

Summary

Audio-Omni is an end-to-end framework that unifies audio understanding, generation, and editing across general sound, music, and speech domains. It addresses the fragmentation of current audio AI capabilities by offering a single model for diverse tasks. This project benefits researchers and developers by enabling sophisticated audio manipulation through natural language, powered by a novel synergistic architecture.

How It Works

The framework integrates a frozen Qwen2.5-Omni Multimodal Large Language Model (MLLM) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity audio synthesis. This approach leverages the emergent reasoning abilities of the MLLM to interpret complex natural language instructions for audio manipulation. To overcome data scarcity in audio editing, a large-scale, high-quality dataset was specifically constructed.

Quick Start & Requirements

Prerequisites: Python 3.11+, CUDA-capable GPU, FFmpeg, and libsndfile.
Installation: Clone the repository, create and activate a Conda environment (conda create -n audio-omni python=3.11 -y, conda activate audio-omni), install the package (pip install -e .), and install FFmpeg/libsndfile (conda install -c conda-forge ffmpeg libsndfile). Optional: pip install flash-attn for faster attention.
Model Download: Use huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/ or Python's snapshot_download.
Demo: Run bash infer_demo.sh or CUDA_VISIBLE_DEVICES=0 python3 run_gradio.py --model-config model/Audio-Omni.json --ckpt-path model/model.ckpt --server-port 7777. The Gradio demo is accessible at http://localhost:7777.

Highlighted Details

Unified Capabilities: Supports audio understanding, text-to-audio (T2A), text-to-music (T2M), video-to-audio (V2A), video-to-music (V2M), text-to-speech (TTS) with voice cloning, and voice conversion (VC).
Advanced Editing: Offers sophisticated audio editing functions including adding, removing, extracting sounds, and style transfer, all controllable via text prompts.
Multimodal Input: Can process text, audio, and video inputs for understanding tasks, and uses voice prompts for TTS and VC.
Emergent Abilities: Inherits advanced reasoning and manipulation capabilities from the underlying Qwen2.5-Omni MLLM.

Maintenance & Community

Contact: Zeyue Tian (ztianad@connect.ust.hk) for inquiries.
Acknowledgments: The project acknowledges contributions from AudioX, VidMuse, MMAudio, F5-TTS, and stable-audio-tools.
Community: No explicit community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

License: The code is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Restrictions: Model weights are strictly for research use. Commercial applications require explicit authorization from the authors.

Limitations & Caveats

Model weights are restricted to non-commercial, research-only applications. Any commercial use necessitates obtaining prior authorization from the project authors.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days