UniAudio2  by yangdongchao

Audio foundation model unifies speech, sound, and music processing

Created 3 weeks ago

New!

385 stars

Top 74.5% on SourcePulse

GitHubView on GitHub
Project Summary

A unified audio foundation model, UniAudio 2.0 addresses speech, sound, and music tasks with a single architecture. It targets researchers and developers seeking a versatile tool for audio generation and understanding, offering strong performance across diverse audio modalities and few-shot/zero-shot scenarios.

How It Works

UniAudio 2.0 employs a novel ReasoningCodec, which utilizes discrete reasoning and reconstruction tokens. This codec is integrated into a unified autoregressive model trained on a massive dataset of 100B text and 60B audio tokens. The multi-stage training and multi-task data approach allows the model to achieve robust performance on in-domain tasks and excel in few-shot or zero-shot learning across speech, sound, and music generation and recognition.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment with Python 3.10, and run pip install -e ..
  • Prerequisites: Python 3.10, conda. Requires downloading checkpoints from HuggingFace (e.g., ReasoningCodec.checkpoint, llm_ep2.checkpoint or llm_ep3.checkpoint). Users must update configuration files (e.g., codec_infer_config.yaml) with downloaded model paths.
  • Links: Demo 🎶 | 📑 Paper | Checkpoints 🤗

Highlighted Details

  • Supports a wide range of tasks including TTS (EN/ZH/Yue), ASR, Text-to-Sound, Audio Captioning, Text-to-Music, and Music Recognition.
  • Features a unified autoregressive model over text and audio.
  • Achieves strong in-domain and few-shot/zero-shot performance.
  • Utilizes a multi-stage training process and multi-task data.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or roadmap details were provided in the README snippet.

Licensing & Compatibility

The project is released under the MIT License, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The model's instruction understanding capabilities may be limited, as it was not trained on explicit text instruction data. Users may need to adjust prompts to achieve desired performance for instruction-following tasks.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
387 stars in the last 22 days

Explore Similar Projects

Feedback? Help us improve.