SEED-Story by TencentARC

MLLM for multimodal long story generation

Created 1 year ago

878 stars

Top 41.0% on SourcePulse

Project Summary

SEED-Story is a multimodal large language model (MLLM) designed for generating long, coherent stories with both text and consistent images. It targets researchers and developers interested in creative AI applications, enabling the creation of up to 25 sequential multimodal story segments from initial text and image prompts.

How It Works

SEED-Story employs a three-stage process. First, a Stable Diffusion XL (SD-XL) based de-tokenizer is pre-trained to reconstruct images from Vision Transformer (ViT) features. Second, the MLLM is trained on interleaved image-text sequences, performing next-word prediction and image feature regression. Finally, the regressed image features are fed back to the de-tokenizer for SD-XL tuning, ensuring character and style consistency across generated images. This approach allows for flexible story generation, adapting to different narrative paths based on initial inputs.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
Model Weights: Requires downloading checkpoints for SEED-X, SD-XL, Llama-2-7b-hf, and Qwen-VL-Chat, and extracting Qwen-VL-Chat's visual encoder.
Data: Download the StoryStream dataset.
Inference: Run python3 src/inference/gen_george.py.
Resources: Requires significant disk space for model weights and dataset.

Highlighted Details

Generates stories with up to 25 multimodal sequences.
Achieves high GPT-4 evaluated scores for style (8.61), engagingness (6.27), and coherence (8.24).
Released with the large-scale StoryStream dataset for multimodal story generation.
Offers inference scripts for story generation and visualization.

Maintenance & Community

The project is from Tencent ARC. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License Version 2.0. Compatibility for commercial use is generally permitted under Apache 2.0, but users should review the specific terms and any third-party component licenses mentioned in the License file.

Limitations & Caveats

The README indicates that training code is available for instruction tuning, but the primary focus of the release is inference. The model was trained on specific subsets of the StoryStream dataset (Curious George, Rabbids Invasion, The Land Before Time), and performance on other domains may vary.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days