VQGAN-CLIP  by nerdyrodent

Local VQGAN+CLIP tool for text-to-image generation

Created 4 years ago
2,660 stars

Top 17.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a local implementation of VQGAN+CLIP, a generative art model that synthesizes images from text prompts. It targets artists, researchers, and hobbyists seeking to run advanced AI image generation without relying on cloud platforms like Google Colab. The primary benefit is enabling local, customizable control over the VQGAN+CLIP pipeline.

How It Works

The project leverages the VQGAN architecture for image encoding and the CLIP model for guiding the generation process based on text descriptions. It combines these components to iteratively refine an image, starting from noise or an initial image, to match the semantic meaning of the provided text prompts. This approach allows for high-fidelity image synthesis guided by natural language.

Quick Start & Requirements

  • Install: Create a conda environment (conda create --name vqgan python=3.9, conda activate vqgan), install PyTorch with CUDA 11.1 (pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html), and then install other dependencies (pip install -r requirements.txt). Clone required repositories: git clone https://github.com/openai/CLIP and git clone https://github.com/CompVis/taming-transformers. Download VQGAN checkpoints to a checkpoints/ directory.
  • Prerequisites: Python 3.9, Anaconda, NVIDIA GPU with CUDA 11.1. VRAM requirements range from 8GB (380x380) to 24GB (900x900).
  • Setup Time: Estimated setup time is approximately 30-60 minutes, depending on download speeds and dependency installation.
  • Links: VQGAN+CLIP GitHub, CLIP GitHub, Taming Transformers GitHub.

Highlighted Details

  • Supports text-to-image generation with weighted and multiple prompts.
  • Enables image-to-image translation and style transfer using an initial image.
  • Includes "Story Mode" for sequential prompt generation and zoom video creation.
  • Offers advanced options for optimizers, learning rates, and image augmentations.

Maintenance & Community

The project is a personal exploration by "nerdyrodent" and does not indicate a formal maintenance team or community channels like Discord/Slack.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it depends on CLIP (MIT License) and Taming Transformers (MIT License). VQGAN models are typically released under permissive licenses, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

AMD GPU support is experimental and requires ROCm installation. CPU-only generation is possible but significantly slower. The project is presented as a personal experiment, implying potential for breaking changes or lack of long-term support. CUDA out-of-memory errors are common for larger resolutions or higher cut counts.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Max Howell Max Howell(Author of Homebrew), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

big-sleep by lucidrains

0%
3k
CLI tool for text-to-image generation
Created 4 years ago
Updated 3 years ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

IP-Adapter by tencent-ailab

0.3%
6k
Adapter for image prompt in text-to-image diffusion models
Created 2 years ago
Updated 1 year ago
Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
7 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
Created 3 years ago
Updated 1 year ago
Starred by Deepak Pathak Deepak Pathak(Cofounder of Skild AI; Professor at CMU), Travis Fischer Travis Fischer(Founder of Agentic), and
8 more.

sygil-webui by Sygil-Dev

0.0%
8k
Web UI for Stable Diffusion
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.