AudioStory  by TencentARC

Generate long-form narrative audio using LLMs

Created 1 month ago
271 stars

Top 95.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

AudioStory tackles the challenge of generating coherent, long-form narrative audio, an area where existing text-to-audio (TTA) models struggle with temporal coherence and compositional reasoning. It introduces a unified framework that integrates Large Language Models (LLMs) with TTA systems, enabling sophisticated applications such as video dubbing, audio continuation, and complex narrative synthesis. This approach benefits users by producing structured audio with consistent emotional tone and scene transitions, significantly improving instruction-following capabilities and audio fidelity compared to prior methods.

How It Works

AudioStory utilizes a unified understanding-generation framework. LLMs parse complex narrative instructions or audio inputs, decomposing them into temporally ordered sub-events with contextual cues. A key innovation is its decoupled bridging mechanism, which separates LLM-diffuser collaboration into specialized components: a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation. This design, coupled with end-to-end training that unifies instruction comprehension and audio generation, enhances synergy and eliminates the need for modular training pipelines, ultimately achieving strong instruction comprehension and high-quality audio generation.

Quick Start & Requirements

  • Installation: Requires Python >= 3.10 (Anaconda recommended) and PyTorch >= 2.1.0. An NVIDIA GPU with CUDA is mandatory. Installation involves cloning the repository, creating a conda environment (conda create -n audiostory python=3.10), activating it (conda activate audiostory), and running bash install_audiostory.sh.
  • Inference: Can be performed using python evaluate/inference.py with specified model paths (e.g., ckpt/audiostory-3B) and parameters like guidance and total duration.
  • Resources: Model checkpoints are available from Hugging Face. Links to demo videos are provided, with a Gradio demo planned for release.

Highlighted Details

  • Features a unified understanding-generation framework integrating LLMs and TTA.
  • Utilizes a novel decoupled bridging mechanism for semantic alignment and coherence.
  • Supports end-to-end training for improved component synergy.
  • Establishes the AudioStory-10K benchmark dataset.
  • Demonstrates capabilities in video dubbing, cross-domain dubbing, and natural sound narrative generation.

Maintenance & Community

The project is developed by researchers from TencentARC and the Institute of Automation, CAS. Key components like a Gradio demo, the AudioStory-10k dataset, and full training codes are listed as future releases. Contact is available via email (guoyuxin2021@ia.ac.cn) for questions and collaborations.

Licensing & Compatibility

The project is licensed under the permissive Apache 2.0 License, allowing for broad compatibility, including commercial use.

Limitations & Caveats

The project is actively under development, with several core components, including the full dataset and training codes, yet to be released. While inference is supported, full reproducibility and training customization are pending further releases.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
272 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.