UniWorld-V1  by PKU-YuanGroup

Unified framework for visual tasks

created 8 months ago
668 stars

Top 51.4% on sourcepulse

GitHubView on GitHub
Project Summary

UniWorld-V1 is a unified framework for visual understanding and generation, targeting researchers and developers in computer vision and multimodal AI. It offers a single model capable of performing a wide array of tasks, including text-to-image generation, image editing, and perception tasks, simplifying complex visual workflows.

How It Works

UniWorld-V1 utilizes contrastive semantic encoders as reference control signals, departing from traditional VAE-encoded references. This approach leverages the fine-detail preservation capabilities of high-resolution global features from Vision-Language Models (VLMs) like Qwen2.5-VL and diffusion models like FLUX.1-dev, enabling precise control over image generation and editing without requiring learnable tokens.

Quick Start & Requirements

  • Installation: Clone the repository, activate a conda environment with Python 3.10, and install dependencies via pip install -r requirements.txt and pip install flash_attn --no-build-isolation.
  • Prerequisites: Requires PyTorch, Hugging Face libraries, and specific model checkpoints (UniWorld-V1, FLUX.1-dev, SigLIP2). GPU with at least 24GB VRAM is recommended for full functionality, with NF4 quantization and offloading options available for lower VRAM.
  • Running: Execute via CLI (python -m univa.serve.cli) or Gradio demo (python app.py).
  • Resources: Download links for models and datasets are provided. Training requires significant VRAM (74GB+ for 512x512, 78GB+ for Stage 2).
  • Links: Hugging Face Model, Hugging Face Dataset, Demos.

Highlighted Details

  • Fully open-sourced models, data, and training/evaluation code.
  • Curated 10+ CV downstream tasks and 286K long-caption samples.
  • Utilizes contrastive visual encoders for reference control, improving fidelity.
  • Integrates image priors via VLM encoding without learnable tokens.
  • Achieves state-of-the-art results on various image generation and editing benchmarks.

Maintenance & Community

The project is actively maintained by PKU-YuanGroup, with community contributions encouraged. Links to Discord and WeChat are provided for community engagement. Related projects like ImgEdit, Open-Sora Plan, and WISE are also highlighted.

Licensing & Compatibility

The primary license is MIT. However, the FLUX weights are under the FLUX.1 [dev] Non-Commercial License, which may restrict commercial use.

Limitations & Caveats

The FLUX model weights are restricted to non-commercial use. Training requires substantial computational resources. Some datasets, like Ghibli-36k, are noted as not having undergone quality filtering.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
11
Star History
674 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.