BLIP3o by JiuhaiChen

Unified multimodal model combining reasoning with generative diffusion

Created 8 months ago

1,616 stars

Top 25.8% on SourcePulse

Project Summary

BLIP3-o is a unified multimodal model designed for both image understanding and generation tasks, targeting researchers and developers in AI. It offers state-of-the-art performance by combining autoregressive models for reasoning with diffusion models for generation, using a novel approach of diffusing semantically rich CLIP image features.

How It Works

BLIP3-o diffuses semantically rich CLIP image features, rather than VAE features or raw pixels. This approach enables a more powerful and efficient architecture, leading to stronger alignment and performance across various multimodal tasks. The model supports multiple diffusion methods, including CLIP + MSE and CLIP + Flow Matching, and can be integrated with different autoregressive backbones like Qwen-2.5-VL.

Quick Start & Requirements

Install: pip install -r requirements.txt after creating and activating a conda environment (conda create -n blip3o python=3.11 -y, conda activate blip3o).
Prerequisites: Python 3.11, setuptools, pip.
Model Checkpoint: Download via python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model', repo_type='model'))".
Dataset: Download pretraining datasets via python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Pretrain', repo_type='dataset'))".
Demo: Available at https://huggingface.co/spaces/BLIP3o/BLIP3o-Demo.

Highlighted Details

Fully open-source: includes training data, recipes, model weights, and code.
Unified architecture for image understanding and generation.
State-of-the-art performance across benchmarks.
Supports tasks: Text → Text, Image → Text, Text → Image, Image → Image editing.
Integrates with EVA-CLIP + SDXL and SigLIP + SANA (coming soon).

Maintenance & Community

Active development with recent updates (May 2025) including code cleanup and dataset releases.
Discussion groups available via Discord: https://discord.gg/SsVYdV84bw.

Licensing & Compatibility

The README does not explicitly state the license.

Limitations & Caveats

The repository is large and may require significant resources for training.
Some tokenizer issues are being fixed for LLaMA 3 integration.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

3

Star History

13 stars in the last 30 days

Explore Similar Projects

ShareGPT-4o-Image by FreedomIntelligence

Dataset and model for GPT-4o-level image generation

Created 6 months ago

Updated 5 months ago

Awesome-Unified-Multimodal-Models by AIDC-AI

Curated list of unified multimodal models, papers, and datasets

Created 8 months ago

Updated 4 months ago

kandinsky-5 by kandinskylab

Advanced diffusion models for versatile video and image generation

Created 5 months ago

Updated 1 week ago

Starred by

Taranjeet Singh

Taranjeet Singh(Cofounder of Mem0) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Lumina-mGPT-2.0 by Alpha-VLLM

Image generation model for broad tasks

Created 9 months ago

Updated 2 months ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer) and

Ross Wightman

Ross Wightman(Author of timm; CV at Hugging Face).

transfusion-pytorch by lucidrains

Pytorch implementation for multimodal model research

Created 1 year ago

Updated 3 days ago

UNO by bytedance

Subject-to-image model for single/multi-subject customization

Created 9 months ago

Updated 4 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

Training-free paradigm for text-to-image generation/editing

Created 2 years ago

Updated 11 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

custom-diffusion by adobe-research

Text-to-image fine-tuning research paper

Created 3 years ago

Updated 1 month ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

1 more.

GLIGEN by gligen

Text-to-image generation research paper using grounded prompts

Created 3 years ago

Updated 1 year ago

Starred by

Stella Rose Biderman

Stella Rose Biderman(Executive Director at EleutherAI),

Travis Fischer

Travis Fischer(Founder of Agentic), and

2 more.

StableCascade by Stability-AI

Image generation model using cascaded diffusion

Created 1 year ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Abubakar Abid

Abubakar Abid(Cofounder of Gradio), and

1 more.

sdnext by vladmandic

WebUI for AI generative image and video creation

Created 3 years ago

Updated 1 day ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Luca Antiga

Luca Antiga(CTO of Lightning AI), and

2 more.

mmagic by open-mmlab

AIGC toolbox for image/video editing and generation

Created 6 years ago

Updated 1 year ago

Feedback? Help us improve.