CogView4 is a suite of advanced text-to-image generation models, including CogView4 (6B parameters), CogView3-Plus (3B parameters), and CogView3, targeting researchers and developers in multimodal AI. It offers high-resolution image generation with native Chinese language support and competitive performance on various benchmarks.
How It Works
CogView4 utilizes a Diffusion Transformer architecture, while CogView3 employs a cascading diffusion approach with a relay diffusion framework. This allows for flexible generation across resolutions up to 2048x2048 and supports both Chinese and English prompts. The models leverage GLM-4-9B or T5-XXL encoders for prompt understanding.
Quick Start & Requirements
- Install:
pip install diffusers transformers accelerate
- Prerequisites: PyTorch with CUDA support, Python 3.8+. BF16 precision is recommended for inference.
- Memory: Minimum 13GB VRAM with CPU offloading and 4-bit text encoder, up to 39GB VRAM without offloading for higher resolutions. 32GB RAM recommended.
- Links: HuggingFace, ModelScope, Diffusers Example
Highlighted Details
- CogView4-6B achieves 85.13 on DPG-Bench Overall and 0.73 on GenEval Overall.
- Supports resolutions from 512x512 up to 2048x2048, with aspect ratios divisible by 32.
- Native Chinese prompt support and generation capabilities.
- Offers CPU offloading and tiling for reduced GPU memory usage.
Maintenance & Community
- Actively developed by THUDM. Recent updates include diffusers adaptation and the upcoming CogKit fine-tuning toolkit.
- Community contributions are welcomed, with existing wrappers for ComfyUI.
- WeChat Community
Licensing & Compatibility
- Code and CogView3 models are licensed under Apache 2.0.
- CogView4 model weights are available for research and commercial use, subject to THUDM's terms.
Limitations & Caveats
- Fine-tuning code is not included in the main repository but is available via CogKit or finetrainers.
- Prompt optimization using an LLM is strongly recommended for optimal generation quality.