Text-to-image research paper using LLMs for interactive prompting
Top 87.4% on sourcepulse
Mini-DALLE3 offers an experimental, interactive text-to-image and text-to-text experience, aiming to replicate the interleaved capabilities of DALL-E 3 and ChatGPT. It targets users seeking a conversational interface for image generation and manipulation, providing a novel way to integrate LLM-driven text and visual content creation.
How It Works
The project leverages large language models (LLMs) to process user prompts and generate images using Stable Diffusion XL (SDXL) via IP-Adapter. This approach allows for an interleaved conversational flow where text and image generation can occur dynamically within the same interaction, mimicking a more natural and intuitive user experience.
Quick Start & Requirements
checkpoints/models/sdxl_models
.export OPENAI_API_KEY="your key" && python -m minidalle3.web
.OPENAI_API_BASE
and running specific LLM modules.Highlighted Details
Maintenance & Community
The project is authored by Zeqiang Lai, Xizhou Zhu, Xizhou Zhu, Jifeng Dai, and Yu Qiao. Further community engagement details are not provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Llama LLM support is not yet implemented, and Qwen has not been tested. Several planned features, including multi-image generation, image selection, and prompt refinement, are still in the TODO list, indicating an experimental and incomplete state.
1 year ago
1 week