UniWorld by PKU-YuanGroup

Unified framework for visual tasks

Created 1 year ago

831 stars

Top 42.8% on SourcePulse

Project Summary

UniWorld-V1 is a unified framework for visual understanding and generation, targeting researchers and developers in computer vision and multimodal AI. It offers a single model capable of performing a wide array of tasks, including text-to-image generation, image editing, and perception tasks, simplifying complex visual workflows.

How It Works

UniWorld-V1 utilizes contrastive semantic encoders as reference control signals, departing from traditional VAE-encoded references. This approach leverages the fine-detail preservation capabilities of high-resolution global features from Vision-Language Models (VLMs) like Qwen2.5-VL and diffusion models like FLUX.1-dev, enabling precise control over image generation and editing without requiring learnable tokens.

Quick Start & Requirements

Installation: Clone the repository, activate a conda environment with Python 3.10, and install dependencies via pip install -r requirements.txt and pip install flash_attn --no-build-isolation.
Prerequisites: Requires PyTorch, Hugging Face libraries, and specific model checkpoints (UniWorld-V1, FLUX.1-dev, SigLIP2). GPU with at least 24GB VRAM is recommended for full functionality, with NF4 quantization and offloading options available for lower VRAM.
Running: Execute via CLI (python -m univa.serve.cli) or Gradio demo (python app.py).
Resources: Download links for models and datasets are provided. Training requires significant VRAM (74GB+ for 512x512, 78GB+ for Stage 2).
Links: Hugging Face Model, Hugging Face Dataset, Demos.

Highlighted Details

Fully open-sourced models, data, and training/evaluation code.
Curated 10+ CV downstream tasks and 286K long-caption samples.
Utilizes contrastive visual encoders for reference control, improving fidelity.
Integrates image priors via VLM encoding without learnable tokens.
Achieves state-of-the-art results on various image generation and editing benchmarks.

Maintenance & Community

The project is actively maintained by PKU-YuanGroup, with community contributions encouraged. Links to Discord and WeChat are provided for community engagement. Related projects like ImgEdit, Open-Sora Plan, and WISE are also highlighted.

Licensing & Compatibility

The primary license is MIT. However, the FLUX weights are under the FLUX.1 [dev] Non-Commercial License, which may restrict commercial use.

Limitations & Caveats

The FLUX model weights are restricted to non-commercial use. Training requires substantial computational resources. Some datasets, like Ghibli-36k, are noted as not having undergone quality filtering.

UniWorld by PKU-YuanGroup

Explore Similar Projects

EVE by baaivision

VoRA by Hon-Wong

Lumina-mGPT by Alpha-VLLM

SemanticStyleGAN by seasonSH

MiniGPT-5 by eric-ai-lab

LVM by ytongbai

Show-o by showlab

Cosmos-Tokenizer by NVIDIA

HunyuanVideo-I2V by Tencent-Hunyuan

ml-mgie by apple

Bagel by ByteDance-Seed

Janus by deepseek-ai