Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 8 months ago

5,549 stars

Top 9.0% on SourcePulse

1 Expert Loves This Project

jiamings

Chief Scientist at Luma AI

Project Summary

BAGEL is an open-source unified multimodal foundation model designed for both understanding and generation tasks. It aims to provide state-of-the-art performance across various benchmarks, including visual understanding, text-to-image generation, and image editing, targeting researchers and developers working with multimodal AI.

How It Works

BAGEL is a 7B active parameter (14B total) model trained on large-scale interleaved multimodal data. Its architecture supports advanced capabilities like free-form visual manipulation, multiview synthesis, and world navigation, positioning it as a "world-modeling" system beyond traditional image editing. The model offers fine-grained control over generation through parameters like cfg_text_scale, cfg_image_scale, and various cfg_renorm_type options for managing text and image guidance during the diffusion process.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (python=3.10), and install requirements (pip install -r requirements.txt flash_attn==2.5.8 --no-build-isolation).
Pretrained Checkpoint: Download via Hugging Face snapshot_download.
Inference: Run python app.py with options for VRAM (32GB+ for full, 12-32GB with NF4 quantization).
Resources: Requires significant VRAM for inference (12GB+).
Links: Official Website, Demo, Report.

Highlighted Details

Outperforms top-tier open-source VLMs like Qwen2.5-VL and InternVL-2.5 on multimodal understanding benchmarks.
Text-to-image quality is competitive with specialist generators like SD3.
Demonstrates superior qualitative results in image editing and extends to "world-modeling" tasks.
Achieves comparable performance to Gemini 2.0 on KRIS-Bench and RISEBench reasoning benchmarks.

Maintenance & Community

Active community contributions noted for Dockerfiles, Windows guidelines, quantization, and Gradio app integration.
Community support channels include Discord and issue tracking for reporting bad cases.
Links: Discord, GitHub Issues.

Licensing & Compatibility

Licensed under the Apache 2.0 license.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions that blurry edited images might occur, suggesting adjustments to CFG parameters. Specific performance on certain benchmarks (e.g., MathVista, IntelligentBench) is either not provided or lower than leading proprietary models like GPT-4o.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

4

Star History

118 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

DreamLLM by RunpeiDong

Multimodal LLM framework for comprehension and creation

Created 2 years ago

Updated 1 year ago

ShareGPT-4o-Image by FreedomIntelligence

Dataset and model for GPT-4o-level image generation

Created 6 months ago

Updated 5 months ago

Awesome-Multimodal-LLM by HenryHZY

Collection of research trends in LLM-guided multimodal learning

Created 2 years ago

Updated 2 years ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Lumina-mGPT by Alpha-VLLM

Multimodal autoregressive model for vision and language tasks

Created 1 year ago

Updated 2 months ago

Awesome-Unified-Multimodal-Models by AIDC-AI

Curated list of unified multimodal models, papers, and datasets

Created 8 months ago

Updated 4 months ago

Starred by

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama).

Liquid by FoundationVision

Multimodal generation research paper

Created 1 year ago

Updated 2 months ago

Awesome-Unified-Multimodal-Models by showlab

Paper list for unified multimodal models

Created 1 year ago

Updated 3 months ago

HunyuanImage-3.0 by Tencent-Hunyuan

Native multimodal model for advanced image generation

Created 3 months ago

Updated 2 months ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM).

LLaVA-NeXT by LLaVA-VL

Multimodal model for image, video, and 3D understanding

Created 1 year ago

Updated 3 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.