diffusion-4k  by zhang0jhon

Synthesize ultra-high-resolution images with latent diffusion models

Created 6 months ago
295 stars

Top 89.8% on SourcePulse

GitHubView on GitHub
Project Summary

Diffusion-4K offers a framework for direct ultra-high-resolution image synthesis using latent diffusion models, targeting researchers and practitioners in generative AI. It addresses the lack of high-resolution benchmarks and introduces a wavelet-based fine-tuning method for enhanced detail synthesis, particularly with large-scale models like SD3-2B and Flux-12B.

How It Works

The framework introduces the Aesthetic-4K benchmark, a curated 4K dataset with GPT-4o-generated captions, and novel evaluation metrics (GLCM Score, Compression Ratio) alongside standard ones (FID, Aesthetics, CLIPScore). Its core technical contribution is a wavelet-based fine-tuning approach that enables direct training on photorealistic 4K images, improving detail preservation and synthesis quality in latent diffusion models.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Requires pre-trained models (SD3-2B, Flux-12B) and the Aesthetic-4K dataset, which need to be downloaded separately. CUDA is implicitly required for diffusion models.
  • Links: Aesthetic-4K dataset: huggingface/Aesthetic-4K, SC-VAE training code: sc-vae, Aesthetic-Train-V2: huggingface/Aesthetic-Train-V2.

Highlighted Details

  • Introduces the Aesthetic-4K benchmark and GLCM Score/Compression Ratio metrics for evaluating ultra-high-resolution image synthesis.
  • Proposes a wavelet-based fine-tuning method for direct 4K image training.
  • Demonstrates effectiveness with large models like SD3-2B and Flux-12B.
  • Provides example generation commands for resolutions up to 4096x3072.

Maintenance & Community

The project is associated with CVPR 2025 and has an arXiv paper. Links to related datasets and training code are provided. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license for the code or the model checkpoints. It acknowledges dependencies on Diffusers, Transformers, SD3, Flux, and CLIP+MLP Aesthetic Score Predictor, whose licenses would apply.

Limitations & Caveats

The project is presented as part of CVPR 2025 submissions, suggesting it may be research-oriented and potentially subject to changes. Explicit licensing information for the core Diffusion-4K components is missing, which could impact commercial use or integration into closed-source projects.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.4%
4k
Image synthesis research paper using a linear diffusion transformer
Created 11 months ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Starred by Robin Huang Robin Huang(Cofounder of Comfy Org), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
17 more.

stablediffusion by Stability-AI

0.1%
42k
Latent diffusion model for high-resolution image synthesis
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.