reconstruction-alignment  by HorizonWind2004

Self-supervised learning for enhanced unified multimodal models

Created 1 month ago
297 stars

Top 89.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository implements "Reconstruction Alignment" (RecA), a self-supervised learning technique that unlocks the zero-shot potential of Unified Multimodal Models (UMMs). RecA significantly enhances task performance and image editing capabilities, targeting researchers and engineers aiming to maximize UMM efficiency.

How It Works

RecA utilizes a novel reconstruction alignment approach via self-supervised learning to boost UMM performance. Applied to architectures like BAGEL, Harmon, Show-o, and OpenUni, it consistently yields substantial improvements. This method enables state-of-the-art results with remarkable efficiency, often surpassing larger models in zero-shot benchmarks.

Quick Start & Requirements

  • Online Demo: Available on Hugging Face Spaces.
  • ComfyUI: Integration instructions provided; requires replacing BAGEL weights with RecA-tuned versions. Supports NF4/INT8.
  • Local Inference: Follow BAGEL Installation Guide and run BAGEL/inference.ipynb.
  • Full Training/Evaluation: Refer to external BAGEL and Harmon Installation Guides.
  • Prerequisites: Significant hardware implied for training (e.g., "6 × 80GB A100s"). Dependencies detailed in linked guides.
  • Key Links: Paper (arxiv.org/pdf/2509.07295), Project Page (reconstruction-alignment.github.io/), HF Models (huggingface.co/collections/sanaka87/realign-68ad2176380355a3dcedc068), HF Demo (huggingface.co/spaces/sanaka87/BAGEL-RecA).

Highlighted Details

  • Achieves state-of-the-art GenEval (0.86) and DPGBench (87.21) with 1.5B Harmon-RecA, outperforming larger models.
  • Significantly boosts BAGEL's image editing performance.
  • Further fine-tuning with GPT-4o-Image distillation data improves scores to 0.90 (GenEval) and 88.15 (DPGBench).
  • Offers quantized versions (INT8, NF4, DF11) for efficiency.
  • Demonstrates superior image editing compared to Icedit, FLUX-Kontext, and GPT-4o.

Maintenance & Community

Recent September 2025 updates indicate active development. Contact via email (sanaka@berkeley.edu, xdwang@eecs.berkeley.edu); issues recommended for implementation questions. No dedicated community channels or roadmap links are provided.

Licensing & Compatibility

Features mixed licensing: majority Apache License. BAGEL/Show-o are Apache-licensed; Harmon/OpenUni use the S-Lab license. Users must comply with these terms, particularly the S-Lab license's potential commercial use restrictions.

Limitations & Caveats

Training code for Show-o and OpenUni architectures is pending release. Future work includes scaling BAGEL training and supporting new UMM architectures like Show-o2. The S-Lab license terms for Harmon/OpenUni require further investigation for commercial applications.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
29 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.