BLIP3o  by JiuhaiChen

Unified multimodal model combining reasoning with generative diffusion

created 3 months ago
1,311 stars

Top 31.2% on sourcepulse

GitHubView on GitHub
Project Summary

BLIP3-o is a unified multimodal model designed for both image understanding and generation tasks, targeting researchers and developers in AI. It offers state-of-the-art performance by combining autoregressive models for reasoning with diffusion models for generation, using a novel approach of diffusing semantically rich CLIP image features.

How It Works

BLIP3-o diffuses semantically rich CLIP image features, rather than VAE features or raw pixels. This approach enables a more powerful and efficient architecture, leading to stronger alignment and performance across various multimodal tasks. The model supports multiple diffusion methods, including CLIP + MSE and CLIP + Flow Matching, and can be integrated with different autoregressive backbones like Qwen-2.5-VL.

Quick Start & Requirements

  • Install: pip install -r requirements.txt after creating and activating a conda environment (conda create -n blip3o python=3.11 -y, conda activate blip3o).
  • Prerequisites: Python 3.11, setuptools, pip.
  • Model Checkpoint: Download via python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model', repo_type='model'))".
  • Dataset: Download pretraining datasets via python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Pretrain', repo_type='dataset'))".
  • Demo: Available at https://huggingface.co/spaces/BLIP3o/BLIP3o-Demo.

Highlighted Details

  • Fully open-source: includes training data, recipes, model weights, and code.
  • Unified architecture for image understanding and generation.
  • State-of-the-art performance across benchmarks.
  • Supports tasks: Text → Text, Image → Text, Text → Image, Image → Image editing.
  • Integrates with EVA-CLIP + SDXL and SigLIP + SANA (coming soon).

Maintenance & Community

  • Active development with recent updates (May 2025) including code cleanup and dataset releases.
  • Discussion groups available via Discord: https://discord.gg/SsVYdV84bw.

Licensing & Compatibility

  • The README does not explicitly state the license.

Limitations & Caveats

  • The repository is large and may require significant resources for training.
  • Some tokenizer issues are being fixed for LLaMA 3 integration.
Health Check
Last commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
7
Star History
1,320 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
2 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
created 3 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Luca Antiga Luca Antiga(CTO of Lightning AI).

mmagic by open-mmlab

0.1%
7k
AIGC toolbox for image/video editing and generation
created 6 years ago
updated 1 year ago
Feedback? Help us improve.