SEED-Story  by TencentARC

MLLM for multimodal long story generation

created 1 year ago
862 stars

Top 42.5% on sourcepulse

GitHubView on GitHub
Project Summary

SEED-Story is a multimodal large language model (MLLM) designed for generating long, coherent stories with both text and consistent images. It targets researchers and developers interested in creative AI applications, enabling the creation of up to 25 sequential multimodal story segments from initial text and image prompts.

How It Works

SEED-Story employs a three-stage process. First, a Stable Diffusion XL (SD-XL) based de-tokenizer is pre-trained to reconstruct images from Vision Transformer (ViT) features. Second, the MLLM is trained on interleaved image-text sequences, performing next-word prediction and image feature regression. Finally, the regressed image features are fed back to the de-tokenizer for SD-XL tuning, ensuring character and style consistency across generated images. This approach allows for flexible story generation, adapting to different narrative paths based on initial inputs.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Requires downloading checkpoints for SEED-X, SD-XL, Llama-2-7b-hf, and Qwen-VL-Chat, and extracting Qwen-VL-Chat's visual encoder.
  • Data: Download the StoryStream dataset.
  • Inference: Run python3 src/inference/gen_george.py.
  • Resources: Requires significant disk space for model weights and dataset.

Highlighted Details

  • Generates stories with up to 25 multimodal sequences.
  • Achieves high GPT-4 evaluated scores for style (8.61), engagingness (6.27), and coherence (8.24).
  • Released with the large-scale StoryStream dataset for multimodal story generation.
  • Offers inference scripts for story generation and visualization.

Maintenance & Community

The project is from Tencent ARC. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License Version 2.0. Compatibility for commercial use is generally permitted under Apache 2.0, but users should review the specific terms and any third-party component licenses mentioned in the License file.

Limitations & Caveats

The README indicates that training code is available for instruction tuning, but the primary focus of the release is inference. The model was trained on specific subsets of the StoryStream dataset (Curious George, Rabbids Invasion, The Land Before Time), and performance on other domains may vary.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.