diffusion-4k  by zhang0jhon

Synthesize ultra-high-resolution images with latent diffusion models

Created 1 year ago
355 stars

Top 78.9% on SourcePulse

GitHubView on GitHub
Project Summary

Diffusion-4K offers a framework for direct ultra-high-resolution image synthesis using latent diffusion models, targeting researchers and practitioners in generative AI. It addresses the lack of high-resolution benchmarks and introduces a wavelet-based fine-tuning method for enhanced detail synthesis, particularly with large-scale models like SD3-2B and Flux-12B.

How It Works

The framework introduces the Aesthetic-4K benchmark, a curated 4K dataset with GPT-4o-generated captions, and novel evaluation metrics (GLCM Score, Compression Ratio) alongside standard ones (FID, Aesthetics, CLIPScore). Its core technical contribution is a wavelet-based fine-tuning approach that enables direct training on photorealistic 4K images, improving detail preservation and synthesis quality in latent diffusion models.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Requires pre-trained models (SD3-2B, Flux-12B) and the Aesthetic-4K dataset, which need to be downloaded separately. CUDA is implicitly required for diffusion models.
  • Links: Aesthetic-4K dataset: huggingface/Aesthetic-4K, SC-VAE training code: sc-vae, Aesthetic-Train-V2: huggingface/Aesthetic-Train-V2.

Highlighted Details

  • Introduces the Aesthetic-4K benchmark and GLCM Score/Compression Ratio metrics for evaluating ultra-high-resolution image synthesis.
  • Proposes a wavelet-based fine-tuning method for direct 4K image training.
  • Demonstrates effectiveness with large models like SD3-2B and Flux-12B.
  • Provides example generation commands for resolutions up to 4096x3072.

Maintenance & Community

The project is associated with CVPR 2025 and has an arXiv paper. Links to related datasets and training code are provided. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license for the code or the model checkpoints. It acknowledges dependencies on Diffusers, Transformers, SD3, Flux, and CLIP+MLP Aesthetic Score Predictor, whose licenses would apply.

Limitations & Caveats

The project is presented as part of CVPR 2025 submissions, suggesting it may be research-oriented and potentially subject to changes. Explicit licensing information for the core Diffusion-4K components is missing, which could impact commercial use or integration into closed-source projects.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.1%
5k
Image synthesis research paper using a linear diffusion transformer
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 1 year ago
Starred by Robin Huang Robin Huang(Cofounder of Comfy Org), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
17 more.

stablediffusion by Stability-AI

0%
42k
Latent diffusion model for high-resolution image synthesis
Created 3 years ago
Updated 8 months ago
Feedback? Help us improve.