SkyReels-V3 by SkyworkAI

Generate high-quality videos with multimodal AI

Created 1 month ago

258 stars

Top 98.1% on SourcePulse

Project Summary

SkyReels V3 is a state-of-the-art multimodal video generation model built on a unified in-context learning framework. It addresses the need for flexible video creation by supporting multi-subject generation from reference images, audio-guided synthesis, and video-to-video transformations. This empowers users in video production, entertainment, and commerce with advanced generative capabilities.

How It Works

The model employs a unified multimodal in-context learning framework. Key generative capabilities include: 1) multi-subject video generation from reference images, maintaining identity and narrative consistency; 2) audio-guided video generation; and 3) video-to-video generation. The "Reference to Video" approach uses a cross-frame pairing strategy and image editing for subject extraction and background completion. Video extension focuses on spatiotemporal consistency and narrative continuation, incorporating intelligent shot switching. Talking avatars are generated via multimodal understanding of voice, image, and emotion, built on a diffusion Transformer architecture with audio-visual alignment.

Quick Start & Requirements

Installation: Clone the repository (https://github.com/SkyworkAI/SkyReels-V3), cd SkyReels-V3, then pip install -r requirements.txt.
Prerequisites: Python 3.12+, CUDA 12.8+.
Models: Downloadable from Hugging Face (https://huggingface.co/Skywork) and ModelScope (https://www.modelscope.cn/models/Skywork).
Memory Optimization: Use --low_vram flag for GPUs under 24GB or reduce --resolution (e.g., 540P).
API Access: An API platform is available at https://www.apifree.ai/explore.

Highlighted Details

Multimodal Generation: Supports generating videos from 1-4 reference images (characters, objects, scenes), audio prompts, and existing video inputs.
Advanced Video Extension: Features dual modes: single-shot (5-30s) for seamless continuation and shot-switching (e.g., Cut-In, Cut-Out) for cinematic transitions.
Lifelike Talking Avatars: Creates avatars from a single image and audio, offering precise lip-sync, multi-style support (real, cartoon, animal), and multi-character scene generation.
Performance Claims: Achieves SOTA performance on core video extension metrics and competitive results in reference consistency, visual quality, and audio-visual sync.
High-Definition Output: Supports 720P resolution, with specific models offering higher capabilities.

Maintenance & Community

The project has seen recent releases in early 2026, including API platform integration and inference code/model weights on Hugging Face and ModelScope. Previous versions and related frameworks are also available. Acknowledgements list contributions from several other open-source projects. No direct community links (e.g., Discord, Slack) are provided.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification for any adoption, particularly concerning commercial use or integration into proprietary systems.

SkyReels-V3 by SkyworkAI

Explore Similar Projects

dolphin by kaleido-lab

Awesome-Open-AI-Sora by Curated-Awesome-Lists

Magic-Me by Zhen-Dong

Ai-movie-clip by LumingMelody

jianying-editor-skill by luoluoluo22

seedance-prompt-skill by songguoxs

DownEdit by nxNull

Awesome-Video-Diffusion-Models by ChenHsing

TaleStreamAI by zqq-nuli

VideoX-Fun by aigc-apps

Pixelle-Video by AIDC-AI

ShortGPT by RayVentura