Bagel  by ByteDance-Seed

Unified multimodal foundation model

created 3 months ago
4,696 stars

Top 10.7% on sourcepulse

GitHubView on GitHub
Project Summary

BAGEL is an open-source unified multimodal foundation model designed for both understanding and generation tasks. It aims to provide state-of-the-art performance across various benchmarks, including visual understanding, text-to-image generation, and image editing, targeting researchers and developers working with multimodal AI.

How It Works

BAGEL is a 7B active parameter (14B total) model trained on large-scale interleaved multimodal data. Its architecture supports advanced capabilities like free-form visual manipulation, multiview synthesis, and world navigation, positioning it as a "world-modeling" system beyond traditional image editing. The model offers fine-grained control over generation through parameters like cfg_text_scale, cfg_image_scale, and various cfg_renorm_type options for managing text and image guidance during the diffusion process.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (python=3.10), and install requirements (pip install -r requirements.txt flash_attn==2.5.8 --no-build-isolation).
  • Pretrained Checkpoint: Download via Hugging Face snapshot_download.
  • Inference: Run python app.py with options for VRAM (32GB+ for full, 12-32GB with NF4 quantization).
  • Resources: Requires significant VRAM for inference (12GB+).
  • Links: Official Website, Demo, Report.

Highlighted Details

  • Outperforms top-tier open-source VLMs like Qwen2.5-VL and InternVL-2.5 on multimodal understanding benchmarks.
  • Text-to-image quality is competitive with specialist generators like SD3.
  • Demonstrates superior qualitative results in image editing and extends to "world-modeling" tasks.
  • Achieves comparable performance to Gemini 2.0 on KRIS-Bench and RISEBench reasoning benchmarks.

Maintenance & Community

  • Active community contributions noted for Dockerfiles, Windows guidelines, quantization, and Gradio app integration.
  • Community support channels include Discord and issue tracking for reporting bad cases.
  • Links: Discord, GitHub Issues.

Licensing & Compatibility

  • Licensed under the Apache 2.0 license.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The README mentions that blurry edited images might occur, suggesting adjustments to CFG parameters. Specific performance on certain benchmarks (e.g., MathVista, IntelligentBench) is either not provided or lower than leading proprietary models like GPT-4o.
Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
28
Star History
4,722 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Luca Antiga Luca Antiga(CTO of Lightning AI).

mmagic by open-mmlab

0.1%
7k
AIGC toolbox for image/video editing and generation
created 6 years ago
updated 1 year ago
Feedback? Help us improve.