Omost  by lllyasviel

Image composer using LLMs to generate code for image creation

Created 1 year ago
7,661 stars

Top 6.8% on SourcePulse

GitHubView on GitHub
Project Summary

Omost is a project that leverages Large Language Models (LLMs) to translate coding capabilities into image composition. It targets users who want to generate images through detailed, structured descriptions, enabling precise control over visual elements and their arrangement. The core benefit is a more programmatic and controllable approach to image generation compared to traditional text-to-image models.

How It Works

Omost utilizes LLMs to generate Python code that interacts with a virtual "Canvas agent." This agent allows for granular control over image composition by defining global scene descriptions and adding local elements with specific locations, offsets, areas, and relative depths. The LLM-generated code is designed to be easily interpretable by diffusion models, facilitating precise image rendering. The project emphasizes a "sub-prompt" strategy for descriptions, breaking down complex prompts into smaller, self-contained units to improve LLM understanding and prevent semantic truncation during encoding.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n omost python=3.10), activate it (conda activate omost), install PyTorch with CUDA 12.1 support (pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121), and install requirements (pip install -r requirements.txt).
  • Running: Execute python gradio_app.py.
  • Hardware: Requires 8GB Nvidia VRAM. Quantized LLMs may require bitsandbytes, which can cause issues on older GPUs (9XX, 10XX, 20XX series).
  • Resources: Official HuggingFace space is available as an alternative.

Highlighted Details

  • Provides 3 pretrained LLM models based on Llama3 and Phi3 variations, including quantized versions.
  • Employs a novel bounding box representation using 9 locations, 9 offsets, and 9 area types (729 combinations) for precise object placement, which LLMs handle more robustly than pixel coordinates.
  • Introduces a "Prompt Prefix Tree" concept to improve prompt understanding by structuring sub-prompts hierarchically.
  • Includes a baseline renderer based on attention score manipulation for parameter-free, style-agnostic region guidance.

Maintenance & Community

  • Models are trained on H100 clusters using fp16 precision.
  • The project is associated with the "Omost Team."
  • Links to related work in multi-modal LLM research are provided.

Licensing & Compatibility

  • Models are subject to the licenses of Llama-3 and Phi-3.
  • The project itself does not specify a license in the README.

Limitations & Caveats

  • The 128k context length of omost-phi-3-mini-128k is unreliable beyond ~8k tokens.
  • Quantizing omost-phi-3-mini-128k to 4 bits is not recommended due to performance degradation.
  • The omost-dolphin-2.9-llama3-8b model is trained without NSFW filtering and requires user-applied safety alignment for public services.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.