Unified multimodal foundation model
Top 10.7% on sourcepulse
BAGEL is an open-source unified multimodal foundation model designed for both understanding and generation tasks. It aims to provide state-of-the-art performance across various benchmarks, including visual understanding, text-to-image generation, and image editing, targeting researchers and developers working with multimodal AI.
How It Works
BAGEL is a 7B active parameter (14B total) model trained on large-scale interleaved multimodal data. Its architecture supports advanced capabilities like free-form visual manipulation, multiview synthesis, and world navigation, positioning it as a "world-modeling" system beyond traditional image editing. The model offers fine-grained control over generation through parameters like cfg_text_scale
, cfg_image_scale
, and various cfg_renorm_type
options for managing text and image guidance during the diffusion process.
Quick Start & Requirements
python=3.10
), and install requirements (pip install -r requirements.txt flash_attn==2.5.8 --no-build-isolation
).snapshot_download
.python app.py
with options for VRAM (32GB+ for full, 12-32GB with NF4 quantization).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
Inactive