CogView4  by zai-org

Text-to-image generation system using cascading diffusion

created 10 months ago
1,079 stars

Top 35.8% on sourcepulse

GitHubView on GitHub
Project Summary

CogView4 is a suite of advanced text-to-image generation models, including CogView4 (6B parameters), CogView3-Plus (3B parameters), and CogView3, targeting researchers and developers in multimodal AI. It offers high-resolution image generation with native Chinese language support and competitive performance on various benchmarks.

How It Works

CogView4 utilizes a Diffusion Transformer architecture, while CogView3 employs a cascading diffusion approach with a relay diffusion framework. This allows for flexible generation across resolutions up to 2048x2048 and supports both Chinese and English prompts. The models leverage GLM-4-9B or T5-XXL encoders for prompt understanding.

Quick Start & Requirements

  • Install: pip install diffusers transformers accelerate
  • Prerequisites: PyTorch with CUDA support, Python 3.8+. BF16 precision is recommended for inference.
  • Memory: Minimum 13GB VRAM with CPU offloading and 4-bit text encoder, up to 39GB VRAM without offloading for higher resolutions. 32GB RAM recommended.
  • Links: HuggingFace, ModelScope, Diffusers Example

Highlighted Details

  • CogView4-6B achieves 85.13 on DPG-Bench Overall and 0.73 on GenEval Overall.
  • Supports resolutions from 512x512 up to 2048x2048, with aspect ratios divisible by 32.
  • Native Chinese prompt support and generation capabilities.
  • Offers CPU offloading and tiling for reduced GPU memory usage.

Maintenance & Community

  • Actively developed by THUDM. Recent updates include diffusers adaptation and the upcoming CogKit fine-tuning toolkit.
  • Community contributions are welcomed, with existing wrappers for ComfyUI.
  • WeChat Community

Licensing & Compatibility

  • Code and CogView3 models are licensed under Apache 2.0.
  • CogView4 model weights are available for research and commercial use, subject to THUDM's terms.

Limitations & Caveats

  • Fine-tuning code is not included in the main repository but is available via CogKit or finetrainers.
  • Prompt optimization using an LLM is strongly recommended for optimal generation quality.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVideo by zai-org

0.4%
12k
Text-to-video generation models (CogVideoX, CogVideo)
created 3 years ago
updated 1 month ago
Feedback? Help us improve.