VQGAN-CLIP by nerdyrodent

Local VQGAN+CLIP tool for text-to-image generation

Created 4 years ago

2,661 stars

Top 17.6% on SourcePulse

View on GitHub

6 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Shyamal Anadkat

Research Scientist at OpenAI

Stella Rose Biderman

Executive Director at EleutherAI

Robin Rombach

Cofounder of Black Forest Labs

and 2 more!

Project Summary

This repository provides a local implementation of VQGAN+CLIP, a generative art model that synthesizes images from text prompts. It targets artists, researchers, and hobbyists seeking to run advanced AI image generation without relying on cloud platforms like Google Colab. The primary benefit is enabling local, customizable control over the VQGAN+CLIP pipeline.

How It Works

The project leverages the VQGAN architecture for image encoding and the CLIP model for guiding the generation process based on text descriptions. It combines these components to iteratively refine an image, starting from noise or an initial image, to match the semantic meaning of the provided text prompts. This approach allows for high-fidelity image synthesis guided by natural language.

Quick Start & Requirements

Install: Create a conda environment (conda create --name vqgan python=3.9, conda activate vqgan), install PyTorch with CUDA 11.1 (pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html), and then install other dependencies (pip install -r requirements.txt). Clone required repositories: git clone https://github.com/openai/CLIP and git clone https://github.com/CompVis/taming-transformers. Download VQGAN checkpoints to a checkpoints/ directory.
Prerequisites: Python 3.9, Anaconda, NVIDIA GPU with CUDA 11.1. VRAM requirements range from 8GB (380x380) to 24GB (900x900).
Setup Time: Estimated setup time is approximately 30-60 minutes, depending on download speeds and dependency installation.
Links: VQGAN+CLIP GitHub, CLIP GitHub, Taming Transformers GitHub.

Highlighted Details

Supports text-to-image generation with weighted and multiple prompts.
Enables image-to-image translation and style transfer using an initial image.
Includes "Story Mode" for sequential prompt generation and zoom video creation.
Offers advanced options for optimizers, learning rates, and image augmentations.

Maintenance & Community

The project is a personal exploration by "nerdyrodent" and does not indicate a formal maintenance team or community channels like Discord/Slack.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it depends on CLIP (MIT License) and Taming Transformers (MIT License). VQGAN models are typically released under permissive licenses, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

AMD GPU support is experimental and requires ROCm installation. CPU-only generation is possible but significantly slower. The project is presented as a personal experiment, implying potential for breaking changes or lack of long-term support. CUDA out-of-memory errors are common for larger resolutions or higher cut counts.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days