UniAudio2 by yangdongchao

Audio foundation model unifies speech, sound, and music processing

Created 2 months ago

274 stars

Top 94.2% on SourcePulse

Project Summary

A unified audio foundation model, UniAudio 2.0 addresses speech, sound, and music tasks with a single architecture. It targets researchers and developers seeking a versatile tool for audio generation and understanding, offering strong performance across diverse audio modalities and few-shot/zero-shot scenarios.

How It Works

UniAudio 2.0 employs a novel ReasoningCodec, which utilizes discrete reasoning and reconstruction tokens. This codec is integrated into a unified autoregressive model trained on a massive dataset of 100B text and 60B audio tokens. The multi-stage training and multi-task data approach allows the model to achieve robust performance on in-domain tasks and excel in few-shot or zero-shot learning across speech, sound, and music generation and recognition.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment with Python 3.10, and run pip install -e ..
Prerequisites: Python 3.10, conda. Requires downloading checkpoints from HuggingFace (e.g., ReasoningCodec.checkpoint, llm_ep2.checkpoint or llm_ep3.checkpoint). Users must update configuration files (e.g., codec_infer_config.yaml) with downloaded model paths.
Links: Demo 🎶 | 📑 Paper | Checkpoints 🤗

Highlighted Details

Supports a wide range of tasks including TTS (EN/ZH/Yue), ASR, Text-to-Sound, Audio Captioning, Text-to-Music, and Music Recognition.
Features a unified autoregressive model over text and audio.
Achieves strong in-domain and few-shot/zero-shot performance.
Utilizes a multi-stage training process and multi-task data.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or roadmap details were provided in the README snippet.

Licensing & Compatibility

The project is released under the MIT License, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The model's instruction understanding capabilities may be limited, as it was not trained on explicit text instruction data. Users may need to adjust prompts to achieve desired performance for instruction-following tasks.

UniAudio2 by yangdongchao

Explore Similar Projects

awesome-audio-plaza by metame-ai

unified-audio by alibaba

UniAudio by yangdongchao

WavJourney by Audio-AGI

VITA-Audio by VITA-MLLM

soundstorm-pytorch by lucidrains

tango by declare-lab

FunMusic by FunAudioLLM

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

higgs-audio by boson-ai

audiocraft by facebookresearch