Omost  by lllyasviel

Image composer using LLMs to generate code for image creation

created 1 year ago
7,658 stars

Top 6.9% on sourcepulse

GitHubView on GitHub
Project Summary

Omost is a project that leverages Large Language Models (LLMs) to translate coding capabilities into image composition. It targets users who want to generate images through detailed, structured descriptions, enabling precise control over visual elements and their arrangement. The core benefit is a more programmatic and controllable approach to image generation compared to traditional text-to-image models.

How It Works

Omost utilizes LLMs to generate Python code that interacts with a virtual "Canvas agent." This agent allows for granular control over image composition by defining global scene descriptions and adding local elements with specific locations, offsets, areas, and relative depths. The LLM-generated code is designed to be easily interpretable by diffusion models, facilitating precise image rendering. The project emphasizes a "sub-prompt" strategy for descriptions, breaking down complex prompts into smaller, self-contained units to improve LLM understanding and prevent semantic truncation during encoding.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n omost python=3.10), activate it (conda activate omost), install PyTorch with CUDA 12.1 support (pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121), and install requirements (pip install -r requirements.txt).
  • Running: Execute python gradio_app.py.
  • Hardware: Requires 8GB Nvidia VRAM. Quantized LLMs may require bitsandbytes, which can cause issues on older GPUs (9XX, 10XX, 20XX series).
  • Resources: Official HuggingFace space is available as an alternative.

Highlighted Details

  • Provides 3 pretrained LLM models based on Llama3 and Phi3 variations, including quantized versions.
  • Employs a novel bounding box representation using 9 locations, 9 offsets, and 9 area types (729 combinations) for precise object placement, which LLMs handle more robustly than pixel coordinates.
  • Introduces a "Prompt Prefix Tree" concept to improve prompt understanding by structuring sub-prompts hierarchically.
  • Includes a baseline renderer based on attention score manipulation for parameter-free, style-agnostic region guidance.

Maintenance & Community

  • Models are trained on H100 clusters using fp16 precision.
  • The project is associated with the "Omost Team."
  • Links to related work in multi-modal LLM research are provided.

Licensing & Compatibility

  • Models are subject to the licenses of Llama-3 and Phi-3.
  • The project itself does not specify a license in the README.

Limitations & Caveats

  • The 128k context length of omost-phi-3-mini-128k is unreliable beyond ~8k tokens.
  • Quantizing omost-phi-3-mini-128k to 4 bits is not recommended due to performance degradation.
  • The omost-dolphin-2.9-llama3-8b model is trained without NSFW filtering and requires user-applied safety alignment for public services.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
101 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Feedback? Help us improve.