AudioStory by TencentARC

Generate long-form narrative audio using LLMs

Created 5 months ago

294 stars

Top 90.1% on SourcePulse

Project Summary

Summary

AudioStory tackles the challenge of generating coherent, long-form narrative audio, an area where existing text-to-audio (TTA) models struggle with temporal coherence and compositional reasoning. It introduces a unified framework that integrates Large Language Models (LLMs) with TTA systems, enabling sophisticated applications such as video dubbing, audio continuation, and complex narrative synthesis. This approach benefits users by producing structured audio with consistent emotional tone and scene transitions, significantly improving instruction-following capabilities and audio fidelity compared to prior methods.

How It Works

AudioStory utilizes a unified understanding-generation framework. LLMs parse complex narrative instructions or audio inputs, decomposing them into temporally ordered sub-events with contextual cues. A key innovation is its decoupled bridging mechanism, which separates LLM-diffuser collaboration into specialized components: a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation. This design, coupled with end-to-end training that unifies instruction comprehension and audio generation, enhances synergy and eliminates the need for modular training pipelines, ultimately achieving strong instruction comprehension and high-quality audio generation.

Quick Start & Requirements

Installation: Requires Python >= 3.10 (Anaconda recommended) and PyTorch >= 2.1.0. An NVIDIA GPU with CUDA is mandatory. Installation involves cloning the repository, creating a conda environment (conda create -n audiostory python=3.10), activating it (conda activate audiostory), and running bash install_audiostory.sh.
Inference: Can be performed using python evaluate/inference.py with specified model paths (e.g., ckpt/audiostory-3B) and parameters like guidance and total duration.
Resources: Model checkpoints are available from Hugging Face. Links to demo videos are provided, with a Gradio demo planned for release.

Highlighted Details

Features a unified understanding-generation framework integrating LLMs and TTA.
Utilizes a novel decoupled bridging mechanism for semantic alignment and coherence.
Supports end-to-end training for improved component synergy.
Establishes the AudioStory-10K benchmark dataset.
Demonstrates capabilities in video dubbing, cross-domain dubbing, and natural sound narrative generation.

Maintenance & Community

The project is developed by researchers from TencentARC and the Institute of Automation, CAS. Key components like a Gradio demo, the AudioStory-10k dataset, and full training codes are listed as future releases. Contact is available via email (guoyuxin2021@ia.ac.cn) for questions and collaborations.

Licensing & Compatibility

The project is licensed under the permissive Apache 2.0 License, allowing for broad compatibility, including commercial use.

Limitations & Caveats

The project is actively under development, with several core components, including the full dataset and training codes, yet to be released. While inference is supported, full reproducibility and training customization are pending further releases.

AudioStory by TencentARC

Explore Similar Projects

Large-Audio-Models by liusongxiang

WavJourney by Audio-AGI

PlayDiffusion by playht

MiMo-Audio by XiaomiMiMo

tango by declare-lab

FunMusic by FunAudioLLM

AudioLDM by haoheliu

mini-omni by gpt-omni

Kimi-Audio by MoonshotAI

higgs-audio by boson-ai

csm by SesameAILabs

VibeVoice by microsoft