Discover and explore top open-source AI tools and projects—updated daily.
Generate long-form narrative audio using LLMs
Top 95.0% on SourcePulse
Summary
AudioStory tackles the challenge of generating coherent, long-form narrative audio, an area where existing text-to-audio (TTA) models struggle with temporal coherence and compositional reasoning. It introduces a unified framework that integrates Large Language Models (LLMs) with TTA systems, enabling sophisticated applications such as video dubbing, audio continuation, and complex narrative synthesis. This approach benefits users by producing structured audio with consistent emotional tone and scene transitions, significantly improving instruction-following capabilities and audio fidelity compared to prior methods.
How It Works
AudioStory utilizes a unified understanding-generation framework. LLMs parse complex narrative instructions or audio inputs, decomposing them into temporally ordered sub-events with contextual cues. A key innovation is its decoupled bridging mechanism, which separates LLM-diffuser collaboration into specialized components: a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation. This design, coupled with end-to-end training that unifies instruction comprehension and audio generation, enhances synergy and eliminates the need for modular training pipelines, ultimately achieving strong instruction comprehension and high-quality audio generation.
Quick Start & Requirements
conda create -n audiostory python=3.10
), activating it (conda activate audiostory
), and running bash install_audiostory.sh
.python evaluate/inference.py
with specified model paths (e.g., ckpt/audiostory-3B
) and parameters like guidance and total duration.Highlighted Details
Maintenance & Community
The project is developed by researchers from TencentARC and the Institute of Automation, CAS. Key components like a Gradio demo, the AudioStory-10k dataset, and full training codes are listed as future releases. Contact is available via email (guoyuxin2021@ia.ac.cn) for questions and collaborations.
Licensing & Compatibility
The project is licensed under the permissive Apache 2.0 License, allowing for broad compatibility, including commercial use.
Limitations & Caveats
The project is actively under development, with several core components, including the full dataset and training codes, yet to be released. While inference is supported, full reproducibility and training customization are pending further releases.
2 weeks ago
Inactive