StoryMem by Kevin-thu

AI for multi-shot long video storytelling

Created 3 weeks ago

New!

595 stars

Top 54.7% on SourcePulse

Project Summary

StoryMem addresses the challenge of generating coherent, minute-long, multi-shot narrative videos from text-based story scripts. It targets researchers and developers in AI video generation, offering a solution for creating visually appealing videos with consistent characters and cinematic quality across sequential shots. The primary benefit is enabling automated, high-fidelity storytelling through advanced diffusion models.

How It Works

StoryMem employs a memory-conditioned, shot-by-shot video diffusion approach. It initiates generation using a Text-to-Video (T2V) model to create the first shot, establishing an initial memory state. Subsequent shots are generated using a fine-tuned Memory-to-Video (M2V) LoRA model, which leverages the accumulated memory from previous shots. This iterative process, where memory is updated after each generated shot, ensures high character coherence and visual consistency throughout the narrative, leading to cinematic quality.

Quick Start & Requirements

Installation: Clone the repository (git clone --single-branch --branch main git@github.com:Kevin-thu/StoryMem.git), navigate into the directory, create and activate a Conda environment (conda create -n storymem python=3.11, conda activate storymem), and install dependencies (pip install -r requirements.txt, pip install flash_attn).
Prerequisites: Python 3.11, Conda. Requires downloading three core models: Wan2.2 T2V, Wan2.2 I2V, and the StoryMem M2V LoRA weights, available via Huggingface.
Run: Execute the example script bash run_example.sh.
Links: Project Page (mentioned, URL not provided), Huggingface model repositories, ST-Bench dataset (./story subfolder).

Highlighted Details

Shot-by-shot video generation using a memory-conditioned diffusion model for long-form storytelling.
Supports MI2V (memory + first-frame image conditioning) and MM2V (memory + first 5 motion frames conditioning) for enhanced shot-to-shot coherence.
Introduces ST-Bench, a dataset comprising 30 long story scripts and 300 detailed video prompts for evaluating multi-shot video storytelling capabilities.
Provides a detailed system prompt template for generating structured, shot-level story scripts suitable for the model.

Maintenance & Community

The project acknowledges its foundation on Wan2.2. No specific details regarding active maintenance, community channels (e.g., Discord, Slack), or a public roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a software license. This absence presents a significant adoption blocker, as the terms for use, modification, and distribution are undefined, potentially restricting commercial or even academic applications.

Limitations & Caveats

The README does not explicitly list known limitations, bugs, or alpha status. However, the nature of advanced AI video generation implies substantial computational resource requirements (e.g., high-end GPUs, significant VRAM) and potentially long generation times, which are typical caveats for such systems. The lack of a defined license is a critical caveat for any potential adoption.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

600 stars in the last 26 days