Ming-UniAudio by inclusionAI

Unified speech LLM for understanding, generation, and editing

Created 3 months ago

419 stars

Top 70.1% on SourcePulse

Project Summary

Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing tasks using a single end-to-end model. It introduces a unified continuous speech tokenizer, MingTok-Audio, designed to integrate semantic and acoustic features, enabling a Speech LLM that balances comprehension and synthesis capabilities. The project's primary innovation is its universal, free-form speech editing model, which allows complex semantic and acoustic modifications guided solely by natural language instructions without manual region specification.

How It Works

The core innovation is MingTok-Audio, a VAE-based continuous speech tokenizer employing a causal Transformer architecture. This tokenizer effectively integrates semantic and acoustic speech features, enabling a closed-loop system with Large Language Models through hierarchical representations. A unified Speech LLM is pre-trained using this tokenizer for both understanding and generation, enhanced with a Diffusion Head for high-quality speech synthesis. Building upon this foundation, a dedicated instruction-guided, free-form speech editing framework is developed, supporting comprehensive semantic and acoustic edits without requiring explicit temporal region specification.

Quick Start & Requirements

Installation:
- Via pip: pip install -r requirements.txt
- Via Docker: docker pull yongjielv/ming_uniaudio:v1.0 (recommended) or build from source.
Prerequisites:
- Hardware: Examples tested on NVIDIA H800-80GB/H20-96G with CUDA 12.4.
- Dependencies: Python environment managed by requirements.txt, modelscope for model downloads.

Model Download: Models available on Hugging Face and ModelScope. Download process can take several hours.

pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master

Documentation: Project Page: https://xqacmer.github.io/Ming-Unitok-Audio.github.io/, Demo notebook: cookbooks/demo.ipynb.

Highlighted Details

Pioneers the first unified continuous speech tokenizer (MingTok-Audio) for both understanding and generation tasks.
Introduces the first Speech LLM leveraging a unified continuous tokenizer for comprehensive speech capabilities.
Features the first universal free-form speech editing model, enabling natural language-guided edits without manual region selection.
Developed the first open-source benchmark for free-form speech editing tasks.
MingTok-Audio demonstrates superior reconstruction performance (e.g., PESQ 4.21) compared to other acoustic tokenizers.
Ming-UniAudio models achieve highly competitive results across speech understanding, generation, and editing benchmarks against industry-leading models.

Maintenance & Community

The project provides links to its Hugging Face and ModelScope model pages, as well as a project page. No specific community channels (e.g., Discord, Slack) or details on core maintainers are provided in the README.

Licensing & Compatibility

The provided README content does not specify a software license. This absence makes it impossible to determine compatibility for commercial use, closed-source linking, or other distribution restrictions.

Limitations & Caveats

The README does not explicitly list limitations or known issues. However, the example usage indicates significant hardware requirements (high-end GPUs with specific CUDA versions), and model downloads can be time-consuming, suggesting potential adoption barriers related to infrastructure and setup duration.

Ming-UniAudio by inclusionAI

Explore Similar Projects

MGM-Omni by JIA-Lab-research

Large-Audio-Models by liusongxiang

Meta-voicebox by SpeechifyInc

FastDiff by Rongjiehuang

naturalspeech2-pytorch by lucidrains

HierSpeechpp by sh-lee-prml

audiolm-pytorch by lucidrains

Step-Audio by stepfun-ai

Zonos by Zyphra

Spark-TTS by SparkAudio

VibeVoice by microsoft

espnet by espnet