UniWorld-V1  by PKU-YuanGroup

Unified framework for visual tasks

Created 10 months ago
704 stars

Top 48.5% on SourcePulse

GitHubView on GitHub
Project Summary

UniWorld-V1 is a unified framework for visual understanding and generation, targeting researchers and developers in computer vision and multimodal AI. It offers a single model capable of performing a wide array of tasks, including text-to-image generation, image editing, and perception tasks, simplifying complex visual workflows.

How It Works

UniWorld-V1 utilizes contrastive semantic encoders as reference control signals, departing from traditional VAE-encoded references. This approach leverages the fine-detail preservation capabilities of high-resolution global features from Vision-Language Models (VLMs) like Qwen2.5-VL and diffusion models like FLUX.1-dev, enabling precise control over image generation and editing without requiring learnable tokens.

Quick Start & Requirements

  • Installation: Clone the repository, activate a conda environment with Python 3.10, and install dependencies via pip install -r requirements.txt and pip install flash_attn --no-build-isolation.
  • Prerequisites: Requires PyTorch, Hugging Face libraries, and specific model checkpoints (UniWorld-V1, FLUX.1-dev, SigLIP2). GPU with at least 24GB VRAM is recommended for full functionality, with NF4 quantization and offloading options available for lower VRAM.
  • Running: Execute via CLI (python -m univa.serve.cli) or Gradio demo (python app.py).
  • Resources: Download links for models and datasets are provided. Training requires significant VRAM (74GB+ for 512x512, 78GB+ for Stage 2).
  • Links: Hugging Face Model, Hugging Face Dataset, Demos.

Highlighted Details

  • Fully open-sourced models, data, and training/evaluation code.
  • Curated 10+ CV downstream tasks and 286K long-caption samples.
  • Utilizes contrastive visual encoders for reference control, improving fidelity.
  • Integrates image priors via VLM encoding without learnable tokens.
  • Achieves state-of-the-art results on various image generation and editing benchmarks.

Maintenance & Community

The project is actively maintained by PKU-YuanGroup, with community contributions encouraged. Links to Discord and WeChat are provided for community engagement. Related projects like ImgEdit, Open-Sora Plan, and WISE are also highlighted.

Licensing & Compatibility

The primary license is MIT. However, the FLUX weights are under the FLUX.1 [dev] Non-Commercial License, which may restrict commercial use.

Limitations & Caveats

The FLUX model weights are restricted to non-commercial use. Training requires substantial computational resources. Some datasets, like Ghibli-36k, are noted as not having undergone quality filtering.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 10 months ago
Updated 7 months ago
Feedback? Help us improve.