Discover and explore top open-source AI tools and projects—updated daily.
inclusionAIUnified speech LLM for understanding, generation, and editing
Top 89.0% on SourcePulse
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing tasks using a single end-to-end model. It introduces a unified continuous speech tokenizer, MingTok-Audio, designed to integrate semantic and acoustic features, enabling a Speech LLM that balances comprehension and synthesis capabilities. The project's primary innovation is its universal, free-form speech editing model, which allows complex semantic and acoustic modifications guided solely by natural language instructions without manual region specification.
How It Works
The core innovation is MingTok-Audio, a VAE-based continuous speech tokenizer employing a causal Transformer architecture. This tokenizer effectively integrates semantic and acoustic speech features, enabling a closed-loop system with Large Language Models through hierarchical representations. A unified Speech LLM is pre-trained using this tokenizer for both understanding and generation, enhanced with a Diffusion Head for high-quality speech synthesis. Building upon this foundation, a dedicated instruction-guided, free-form speech editing framework is developed, supporting comprehensive semantic and acoustic edits without requiring explicit temporal region specification.
Quick Start & Requirements
pip install -r requirements.txtdocker pull yongjielv/ming_uniaudio:v1.0 (recommended) or build from source.requirements.txt, modelscope for model downloads.pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
https://xqacmer.github.io/Ming-Unitok-Audio.github.io/, Demo notebook: cookbooks/demo.ipynb.Highlighted Details
Maintenance & Community
The project provides links to its Hugging Face and ModelScope model pages, as well as a project page. No specific community channels (e.g., Discord, Slack) or details on core maintainers are provided in the README.
Licensing & Compatibility
The provided README content does not specify a software license. This absence makes it impossible to determine compatibility for commercial use, closed-source linking, or other distribution restrictions.
Limitations & Caveats
The README does not explicitly list limitations or known issues. However, the example usage indicates significant hardware requirements (high-end GPUs with specific CUDA versions), and model downloads can be time-consuming, suggesting potential adoption barriers related to infrastructure and setup duration.
1 week ago
Inactive
lucidrains
espnet