Omost by lllyasviel

Image composer using LLMs to generate code for image creation

Created 1 year ago

7,663 stars

Top 6.7% on SourcePulse

View on GitHub

5 Experts Love This Project

Chief Scientist at Luma AI

and 1 more!

Project Summary

Omost is a project that leverages Large Language Models (LLMs) to translate coding capabilities into image composition. It targets users who want to generate images through detailed, structured descriptions, enabling precise control over visual elements and their arrangement. The core benefit is a more programmatic and controllable approach to image generation compared to traditional text-to-image models.

How It Works

Omost utilizes LLMs to generate Python code that interacts with a virtual "Canvas agent." This agent allows for granular control over image composition by defining global scene descriptions and adding local elements with specific locations, offsets, areas, and relative depths. The LLM-generated code is designed to be easily interpretable by diffusion models, facilitating precise image rendering. The project emphasizes a "sub-prompt" strategy for descriptions, breaking down complex prompts into smaller, self-contained units to improve LLM understanding and prevent semantic truncation during encoding.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n omost python=3.10), activate it (conda activate omost), install PyTorch with CUDA 12.1 support (pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121), and install requirements (pip install -r requirements.txt).
Running: Execute python gradio_app.py.
Hardware: Requires 8GB Nvidia VRAM. Quantized LLMs may require bitsandbytes, which can cause issues on older GPUs (9XX, 10XX, 20XX series).
Resources: Official HuggingFace space is available as an alternative.

Highlighted Details

Provides 3 pretrained LLM models based on Llama3 and Phi3 variations, including quantized versions.
Employs a novel bounding box representation using 9 locations, 9 offsets, and 9 area types (729 combinations) for precise object placement, which LLMs handle more robustly than pixel coordinates.
Introduces a "Prompt Prefix Tree" concept to improve prompt understanding by structuring sub-prompts hierarchically.
Includes a baseline renderer based on attention score manipulation for parameter-free, style-agnostic region guidance.

Maintenance & Community

Models are trained on H100 clusters using fp16 precision.
The project is associated with the "Omost Team."
Links to related work in multi-modal LLM research are provided.

Licensing & Compatibility

Models are subject to the licenses of Llama-3 and Phi-3.
The project itself does not specify a license in the README.

Limitations & Caveats

The 128k context length of omost-phi-3-mini-128k is unreliable beyond ~8k tokens.
Quantizing omost-phi-3-mini-128k to 4 bits is not recommended due to performance degradation.
The omost-dolphin-2.9-llama3-8b model is trained without NSFW filtering and requires user-applied safety alignment for public services.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days