CogVideo by zai-org

Text-to-video generation models (CogVideoX, CogVideo)

Created 3 years ago

12,302 stars

Top 4.1% on SourcePulse

View on GitHub

7 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Abubakar Abid

Cofounder of Gradio

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Dror Weiss

Cofounder of Tabnine

and 3 more!

Project Summary

CogVideoX is an open-source framework for text-to-video and image-to-video generation, building upon the original CogVideo model. It offers multiple model sizes (2B and 5B parameters) and versions, including CogVideoX1.5, with varying capabilities in video resolution, length, and inference precision. The project targets researchers and developers looking to leverage advanced video generation technology.

How It Works

CogVideoX utilizes diffusion models, specifically Transformer-based architectures, for video generation. It supports both SAT (Self-Attention Transformer) and Hugging Face Diffusers implementations. The models are trained with long prompts, and prompt optimization using LLMs like GLM-4 is recommended for optimal results. Quantization techniques (INT8, FP8) are available via TorchAO and Optimum-quanto to reduce memory footprint and enable inference on lower-end GPUs.

Quick Start & Requirements

Installation: pip install -r requirements.txt (for Diffusers) or follow SAT instructions.
Python Version: 3.10 to 3.12.
Hardware:
- CogVideoX-2B: GTX 1080Ti (SAT BF16: 18GB, Diffusers FP16: 4GB minimum).
- CogVideoX-5B: RTX 3060 (SAT BF16: 26GB, Diffusers BF16: 5GB minimum).
- Higher-end GPUs (A100, H100) are recommended for faster inference and larger models.
- FP8 precision requires NVIDIA H100+ and specific PyTorch/TorchAO installations. CUDA 12.4 recommended.
Resources: Colab notebooks are provided for quick testing.
Links: Huggingface Space, ModelScope Space, CogKit, Technical Report.

Highlighted Details

Supports text-to-video, image-to-video, and video-to-video generation.
CogVideoX1.5-5B supports 10-second videos at 1360x768 resolution.
CogVideoX1.5-5B-I2V supports arbitrary resolutions for image-to-video.
Fine-tuning framework cogvideox-factory allows single 4090 GPU fine-tuning.

Maintenance & Community

The project is actively updated with new models and features (e.g., CogVideoX1.5, DDIM Inverse, CogKit). Community contributions are welcomed, with several community-adapted projects listed. Links to Discord and WeChat are available.

Licensing & Compatibility

Code: Apache 2.0 License.
CogVideoX-2B Model: Apache 2.0 License.
CogVideoX-5B Model: CogVideoX LICENSE (specific terms not detailed in README). Commercial use may require careful review of the CogVideoX LICENSE.

Limitations & Caveats

The CogVideoX-5B model license is not explicitly Apache 2.0, potentially impacting commercial use. INT8 quantization, while reducing memory, significantly slows down inference. FP8 precision is restricted to H100+ GPUs. Prompt optimization is crucial for quality and requires using an LLM.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

95 stars in the last 30 days