Audio-Omni  by ZeyueT

Unified multimodal audio framework for generation and editing

Created 2 months ago
383 stars

Top 74.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Audio-Omni is an end-to-end framework that unifies audio understanding, generation, and editing across general sound, music, and speech domains. It addresses the fragmentation of current audio AI capabilities by offering a single model for diverse tasks. This project benefits researchers and developers by enabling sophisticated audio manipulation through natural language, powered by a novel synergistic architecture.

How It Works

The framework integrates a frozen Qwen2.5-Omni Multimodal Large Language Model (MLLM) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity audio synthesis. This approach leverages the emergent reasoning abilities of the MLLM to interpret complex natural language instructions for audio manipulation. To overcome data scarcity in audio editing, a large-scale, high-quality dataset was specifically constructed.

Quick Start & Requirements

  • Prerequisites: Python 3.11+, CUDA-capable GPU, FFmpeg, and libsndfile.
  • Installation: Clone the repository, create and activate a Conda environment (conda create -n audio-omni python=3.11 -y, conda activate audio-omni), install the package (pip install -e .), and install FFmpeg/libsndfile (conda install -c conda-forge ffmpeg libsndfile). Optional: pip install flash-attn for faster attention.
  • Model Download: Use huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/ or Python's snapshot_download.
  • Demo: Run bash infer_demo.sh or CUDA_VISIBLE_DEVICES=0 python3 run_gradio.py --model-config model/Audio-Omni.json --ckpt-path model/model.ckpt --server-port 7777. The Gradio demo is accessible at http://localhost:7777.

Highlighted Details

  • Unified Capabilities: Supports audio understanding, text-to-audio (T2A), text-to-music (T2M), video-to-audio (V2A), video-to-music (V2M), text-to-speech (TTS) with voice cloning, and voice conversion (VC).
  • Advanced Editing: Offers sophisticated audio editing functions including adding, removing, extracting sounds, and style transfer, all controllable via text prompts.
  • Multimodal Input: Can process text, audio, and video inputs for understanding tasks, and uses voice prompts for TTS and VC.
  • Emergent Abilities: Inherits advanced reasoning and manipulation capabilities from the underlying Qwen2.5-Omni MLLM.

Maintenance & Community

  • Contact: Zeyue Tian (ztianad@connect.ust.hk) for inquiries.
  • Acknowledgments: The project acknowledges contributions from AudioX, VidMuse, MMAudio, F5-TTS, and stable-audio-tools.
  • Community: No explicit community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

  • License: The code is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
  • Restrictions: Model weights are strictly for research use. Commercial applications require explicit authorization from the authors.

Limitations & Caveats

Model weights are restricted to non-commercial, research-only applications. Any commercial use necessitates obtaining prior authorization from the project authors.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
224 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.0%
3k
Audio generation research paper using latent diffusion
Created 3 years ago
Updated 11 months ago
Feedback? Help us improve.