reconstruction-alignment by HorizonWind2004

Self-supervised learning for enhanced unified multimodal models

Created 4 months ago

343 stars

Top 80.6% on SourcePulse

Project Summary

This repository implements "Reconstruction Alignment" (RecA), a self-supervised learning technique that unlocks the zero-shot potential of Unified Multimodal Models (UMMs). RecA significantly enhances task performance and image editing capabilities, targeting researchers and engineers aiming to maximize UMM efficiency.

How It Works

RecA utilizes a novel reconstruction alignment approach via self-supervised learning to boost UMM performance. Applied to architectures like BAGEL, Harmon, Show-o, and OpenUni, it consistently yields substantial improvements. This method enables state-of-the-art results with remarkable efficiency, often surpassing larger models in zero-shot benchmarks.

Quick Start & Requirements

Online Demo: Available on Hugging Face Spaces.
ComfyUI: Integration instructions provided; requires replacing BAGEL weights with RecA-tuned versions. Supports NF4/INT8.
Local Inference: Follow BAGEL Installation Guide and run BAGEL/inference.ipynb.
Full Training/Evaluation: Refer to external BAGEL and Harmon Installation Guides.
Prerequisites: Significant hardware implied for training (e.g., "6 × 80GB A100s"). Dependencies detailed in linked guides.
Key Links: Paper (arxiv.org/pdf/2509.07295), Project Page (reconstruction-alignment.github.io/), HF Models (huggingface.co/collections/sanaka87/realign-68ad2176380355a3dcedc068), HF Demo (huggingface.co/spaces/sanaka87/BAGEL-RecA).

Highlighted Details

Achieves state-of-the-art GenEval (0.86) and DPGBench (87.21) with 1.5B Harmon-RecA, outperforming larger models.
Significantly boosts BAGEL's image editing performance.
Further fine-tuning with GPT-4o-Image distillation data improves scores to 0.90 (GenEval) and 88.15 (DPGBench).
Offers quantized versions (INT8, NF4, DF11) for efficiency.
Demonstrates superior image editing compared to Icedit, FLUX-Kontext, and GPT-4o.

Maintenance & Community

Recent September 2025 updates indicate active development. Contact via email (sanaka@berkeley.edu, xdwang@eecs.berkeley.edu); issues recommended for implementation questions. No dedicated community channels or roadmap links are provided.

Licensing & Compatibility

Features mixed licensing: majority Apache License. BAGEL/Show-o are Apache-licensed; Harmon/OpenUni use the S-Lab license. Users must comply with these terms, particularly the S-Lab license's potential commercial use restrictions.

Limitations & Caveats

Training code for Show-o and OpenUni architectures is pending release. Future work includes scaling BAGEL training and supporting new UMM architectures like Show-o2. The S-Lab license terms for Harmon/OpenUni require further investigation for commercial applications.

reconstruction-alignment by HorizonWind2004

Explore Similar Projects

ComfyUI-HyperLoRA by bytedance

FlashPortrait by Francis-Rings

llava-phi by xmoanvaf

ComfyUI-DyPE by wildminder

OneLLM by csuhan

lmms-finetune by zjysteven

awesome-foundation-and-multimodal-models by SkalskiP

zeta by kyegomez

BLIP3o by JiuhaiChen

Show-o by showlab

custom-diffusion by adobe-research

transformers by huggingface