VQGAN-CLIP  by nerdyrodent

Local VQGAN+CLIP tool for text-to-image generation

created 4 years ago
2,657 stars

Top 18.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a local implementation of VQGAN+CLIP, a generative art model that synthesizes images from text prompts. It targets artists, researchers, and hobbyists seeking to run advanced AI image generation without relying on cloud platforms like Google Colab. The primary benefit is enabling local, customizable control over the VQGAN+CLIP pipeline.

How It Works

The project leverages the VQGAN architecture for image encoding and the CLIP model for guiding the generation process based on text descriptions. It combines these components to iteratively refine an image, starting from noise or an initial image, to match the semantic meaning of the provided text prompts. This approach allows for high-fidelity image synthesis guided by natural language.

Quick Start & Requirements

  • Install: Create a conda environment (conda create --name vqgan python=3.9, conda activate vqgan), install PyTorch with CUDA 11.1 (pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html), and then install other dependencies (pip install -r requirements.txt). Clone required repositories: git clone https://github.com/openai/CLIP and git clone https://github.com/CompVis/taming-transformers. Download VQGAN checkpoints to a checkpoints/ directory.
  • Prerequisites: Python 3.9, Anaconda, NVIDIA GPU with CUDA 11.1. VRAM requirements range from 8GB (380x380) to 24GB (900x900).
  • Setup Time: Estimated setup time is approximately 30-60 minutes, depending on download speeds and dependency installation.
  • Links: VQGAN+CLIP GitHub, CLIP GitHub, Taming Transformers GitHub.

Highlighted Details

  • Supports text-to-image generation with weighted and multiple prompts.
  • Enables image-to-image translation and style transfer using an initial image.
  • Includes "Story Mode" for sequential prompt generation and zoom video creation.
  • Offers advanced options for optimizers, learning rates, and image augmentations.

Maintenance & Community

The project is a personal exploration by "nerdyrodent" and does not indicate a formal maintenance team or community channels like Discord/Slack.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it depends on CLIP (MIT License) and Taming Transformers (MIT License). VQGAN models are typically released under permissive licenses, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

AMD GPU support is experimental and requires ROCm installation. CPU-only generation is possible but significantly slower. The project is presented as a personal experiment, implying potential for breaking changes or lack of long-term support. CUDA out-of-memory errors are common for larger resolutions or higher cut counts.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
2 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
created 3 years ago
updated 1 year ago
Feedback? Help us improve.