Ming-UniAudio  by inclusionAI

Unified speech LLM for understanding, generation, and editing

Created 1 month ago
298 stars

Top 89.0% on SourcePulse

GitHubView on GitHub
Project Summary

Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing tasks using a single end-to-end model. It introduces a unified continuous speech tokenizer, MingTok-Audio, designed to integrate semantic and acoustic features, enabling a Speech LLM that balances comprehension and synthesis capabilities. The project's primary innovation is its universal, free-form speech editing model, which allows complex semantic and acoustic modifications guided solely by natural language instructions without manual region specification.

How It Works

The core innovation is MingTok-Audio, a VAE-based continuous speech tokenizer employing a causal Transformer architecture. This tokenizer effectively integrates semantic and acoustic speech features, enabling a closed-loop system with Large Language Models through hierarchical representations. A unified Speech LLM is pre-trained using this tokenizer for both understanding and generation, enhanced with a Diffusion Head for high-quality speech synthesis. Building upon this foundation, a dedicated instruction-guided, free-form speech editing framework is developed, supporting comprehensive semantic and acoustic edits without requiring explicit temporal region specification.

Quick Start & Requirements

  • Installation:
    • Via pip: pip install -r requirements.txt
    • Via Docker: docker pull yongjielv/ming_uniaudio:v1.0 (recommended) or build from source.
  • Prerequisites:
    • Hardware: Examples tested on NVIDIA H800-80GB/H20-96G with CUDA 12.4.
    • Dependencies: Python environment managed by requirements.txt, modelscope for model downloads.
  • Model Download: Models available on Hugging Face and ModelScope. Download process can take several hours.
    pip install modelscope
    modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
    
  • Documentation: Project Page: https://xqacmer.github.io/Ming-Unitok-Audio.github.io/, Demo notebook: cookbooks/demo.ipynb.

Highlighted Details

  • Pioneers the first unified continuous speech tokenizer (MingTok-Audio) for both understanding and generation tasks.
  • Introduces the first Speech LLM leveraging a unified continuous tokenizer for comprehensive speech capabilities.
  • Features the first universal free-form speech editing model, enabling natural language-guided edits without manual region selection.
  • Developed the first open-source benchmark for free-form speech editing tasks.
  • MingTok-Audio demonstrates superior reconstruction performance (e.g., PESQ 4.21) compared to other acoustic tokenizers.
  • Ming-UniAudio models achieve highly competitive results across speech understanding, generation, and editing benchmarks against industry-leading models.

Maintenance & Community

The project provides links to its Hugging Face and ModelScope model pages, as well as a project page. No specific community channels (e.g., Discord, Slack) or details on core maintainers are provided in the README.

Licensing & Compatibility

The provided README content does not specify a software license. This absence makes it impossible to determine compatibility for commercial use, closed-source linking, or other distribution restrictions.

Limitations & Caveats

The README does not explicitly list limitations or known issues. However, the example usage indicates significant hardware requirements (high-end GPUs with specific CUDA versions), and model downloads can be time-consuming, suggesting potential adoption barriers related to infrastructure and setup duration.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
21
Star History
181 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
10k
End-to-end speech processing toolkit for various speech tasks
Created 8 years ago
Updated 1 day ago
Feedback? Help us improve.