MLLM for multimodal long story generation
Top 42.5% on sourcepulse
SEED-Story is a multimodal large language model (MLLM) designed for generating long, coherent stories with both text and consistent images. It targets researchers and developers interested in creative AI applications, enabling the creation of up to 25 sequential multimodal story segments from initial text and image prompts.
How It Works
SEED-Story employs a three-stage process. First, a Stable Diffusion XL (SD-XL) based de-tokenizer is pre-trained to reconstruct images from Vision Transformer (ViT) features. Second, the MLLM is trained on interleaved image-text sequences, performing next-word prediction and image feature regression. Finally, the regressed image features are fed back to the de-tokenizer for SD-XL tuning, ensuring character and style consistency across generated images. This approach allows for flexible story generation, adapting to different narrative paths based on initial inputs.
Quick Start & Requirements
pip install -r requirements.txt
.python3 src/inference/gen_george.py
.Highlighted Details
Maintenance & Community
The project is from Tencent ARC. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
Licensed under the Apache License Version 2.0. Compatibility for commercial use is generally permitted under Apache 2.0, but users should review the specific terms and any third-party component licenses mentioned in the License
file.
Limitations & Caveats
The README indicates that training code is available for instruction tuning, but the primary focus of the release is inference. The model was trained on specific subsets of the StoryStream dataset (Curious George, Rabbids Invasion, The Land Before Time), and performance on other domains may vary.
9 months ago
1 day