LaVi-Bridge  by ShihaoZhaoZSH

Text-to-image generation research paper

Created 1 year ago
296 stars

Top 89.5% on SourcePulse

GitHubView on GitHub
Project Summary

LaVi-Bridge enables flexible text-to-image generation by integrating diverse pre-trained language models (LLMs) with generative vision models. It targets researchers and practitioners in AI and computer vision who want to experiment with novel LLM-vision model combinations for image synthesis without modifying base model weights. The primary benefit is a plug-and-play framework that leverages LoRA and adapters for seamless integration.

How It Works

LaVi-Bridge acts as an intermediary layer, connecting various LLMs to diffusion-based vision models. It utilizes LoRA (Low-Rank Adaptation) and adapter modules to inject the LLM's understanding into the vision model's generation process. This approach avoids fine-tuning the entire LLM or vision model, making integration efficient and preserving the original model capabilities.

Quick Start & Requirements

  • Install via conda env create -f environment.yaml and conda activate lavi-bridge.
  • Requires pre-trained LoRA/adapters (download link provided).
  • For Llama-2 integration, download Llama-2-7b weights and update run.sh with --llama2_dir.
  • Official project page and paper (ArXiv) links are available.

Highlighted Details

  • Supports combinations like T5-Large + U-Net(SD) and Llama-2 + U-Net(SD).
  • Offers training scripts for custom datasets, recommending COCO2017 and JourneyDB.
  • Allows experimentation with different LLMs (T5 variants, Llama-2) and vision backbones (U-Net, Transformer).

Maintenance & Community

  • Project is associated with ECCV 2024.
  • Built upon another repository (link provided).

Licensing & Compatibility

  • License type is not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license, which may impact commercial adoption. While flexible, the setup requires downloading separate pre-trained weights for LLMs and adapters, adding to the initial resource footprint.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.