SkyReels-V3  by SkyworkAI

Generate high-quality videos with multimodal AI

Created 1 month ago
258 stars

Top 98.1% on SourcePulse

GitHubView on GitHub
Project Summary

SkyReels V3 is a state-of-the-art multimodal video generation model built on a unified in-context learning framework. It addresses the need for flexible video creation by supporting multi-subject generation from reference images, audio-guided synthesis, and video-to-video transformations. This empowers users in video production, entertainment, and commerce with advanced generative capabilities.

How It Works

The model employs a unified multimodal in-context learning framework. Key generative capabilities include: 1) multi-subject video generation from reference images, maintaining identity and narrative consistency; 2) audio-guided video generation; and 3) video-to-video generation. The "Reference to Video" approach uses a cross-frame pairing strategy and image editing for subject extraction and background completion. Video extension focuses on spatiotemporal consistency and narrative continuation, incorporating intelligent shot switching. Talking avatars are generated via multimodal understanding of voice, image, and emotion, built on a diffusion Transformer architecture with audio-visual alignment.

Quick Start & Requirements

  • Installation: Clone the repository (https://github.com/SkyworkAI/SkyReels-V3), cd SkyReels-V3, then pip install -r requirements.txt.
  • Prerequisites: Python 3.12+, CUDA 12.8+.
  • Models: Downloadable from Hugging Face (https://huggingface.co/Skywork) and ModelScope (https://www.modelscope.cn/models/Skywork).
  • Memory Optimization: Use --low_vram flag for GPUs under 24GB or reduce --resolution (e.g., 540P).
  • API Access: An API platform is available at https://www.apifree.ai/explore.

Highlighted Details

  • Multimodal Generation: Supports generating videos from 1-4 reference images (characters, objects, scenes), audio prompts, and existing video inputs.
  • Advanced Video Extension: Features dual modes: single-shot (5-30s) for seamless continuation and shot-switching (e.g., Cut-In, Cut-Out) for cinematic transitions.
  • Lifelike Talking Avatars: Creates avatars from a single image and audio, offering precise lip-sync, multi-style support (real, cartoon, animal), and multi-character scene generation.
  • Performance Claims: Achieves SOTA performance on core video extension metrics and competitive results in reference consistency, visual quality, and audio-visual sync.
  • High-Definition Output: Supports 720P resolution, with specific models offering higher capabilities.

Maintenance & Community

The project has seen recent releases in early 2026, including API platform integration and inference code/model weights on Hugging Face and ModelScope. Previous versions and related frameworks are also available. Acknowledgements list contributions from several other open-source projects. No direct community links (e.g., Discord, Slack) are provided.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification for any adoption, particularly concerning commercial use or integration into proprietary systems.

Limitations & Caveats

  • Hardware Requirements: High VRAM (24GB+) is recommended for optimal performance, though low-VRAM options are available.
  • Dependency Versions: Requires specific Python (3.12+) and CUDA (12.8+) versions.
  • Unspecified License: The lack of a clear license is a significant adoption blocker.
  • Rapid Development: Recent releases suggest ongoing development, which may imply potential for breaking changes.
Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
8
Star History
255 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.