Discover and explore top open-source AI tools and projects—updated daily.
ZeyueTUnified multimodal audio framework for generation and editing
Top 74.4% on SourcePulse
Summary
Audio-Omni is an end-to-end framework that unifies audio understanding, generation, and editing across general sound, music, and speech domains. It addresses the fragmentation of current audio AI capabilities by offering a single model for diverse tasks. This project benefits researchers and developers by enabling sophisticated audio manipulation through natural language, powered by a novel synergistic architecture.
How It Works
The framework integrates a frozen Qwen2.5-Omni Multimodal Large Language Model (MLLM) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity audio synthesis. This approach leverages the emergent reasoning abilities of the MLLM to interpret complex natural language instructions for audio manipulation. To overcome data scarcity in audio editing, a large-scale, high-quality dataset was specifically constructed.
Quick Start & Requirements
conda create -n audio-omni python=3.11 -y, conda activate audio-omni), install the package (pip install -e .), and install FFmpeg/libsndfile (conda install -c conda-forge ffmpeg libsndfile). Optional: pip install flash-attn for faster attention.huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/ or Python's snapshot_download.bash infer_demo.sh or CUDA_VISIBLE_DEVICES=0 python3 run_gradio.py --model-config model/Audio-Omni.json --ckpt-path model/model.ckpt --server-port 7777. The Gradio demo is accessible at http://localhost:7777.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Model weights are restricted to non-commercial, research-only applications. Any commercial use necessitates obtaining prior authorization from the project authors.
3 weeks ago
Inactive
haoheliu
lucidrains