Unified framework for visual tasks
Top 51.4% on sourcepulse
UniWorld-V1 is a unified framework for visual understanding and generation, targeting researchers and developers in computer vision and multimodal AI. It offers a single model capable of performing a wide array of tasks, including text-to-image generation, image editing, and perception tasks, simplifying complex visual workflows.
How It Works
UniWorld-V1 utilizes contrastive semantic encoders as reference control signals, departing from traditional VAE-encoded references. This approach leverages the fine-detail preservation capabilities of high-resolution global features from Vision-Language Models (VLMs) like Qwen2.5-VL and diffusion models like FLUX.1-dev, enabling precise control over image generation and editing without requiring learnable tokens.
Quick Start & Requirements
conda
environment with Python 3.10, and install dependencies via pip install -r requirements.txt
and pip install flash_attn --no-build-isolation
.python -m univa.serve.cli
) or Gradio demo (python app.py
).Highlighted Details
Maintenance & Community
The project is actively maintained by PKU-YuanGroup, with community contributions encouraged. Links to Discord and WeChat are provided for community engagement. Related projects like ImgEdit, Open-Sora Plan, and WISE are also highlighted.
Licensing & Compatibility
The primary license is MIT. However, the FLUX weights are under the FLUX.1 [dev] Non-Commercial License, which may restrict commercial use.
Limitations & Caveats
The FLUX model weights are restricted to non-commercial use. Training requires substantial computational resources. Some datasets, like Ghibli-36k, are noted as not having undergone quality filtering.
1 day ago
Inactive