Bunny by BAAI-DCAI

Multimodal models for vision-language tasks

Created 1 year ago

1,050 stars

Top 35.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

Jeremy Howard

Cofounder of fast.ai

Project Summary

Bunny is a family of lightweight, high-performance multimodal models designed for efficient vision-language understanding. It offers flexibility by supporting various plug-and-play vision encoders (EVA-CLIP, SigLIP) and language backbones (Llama-3-8B, Phi-3-mini, etc.), making it suitable for researchers and developers seeking adaptable multimodal solutions.

How It Works

Bunny employs a strategy of curating more informative training data to compensate for smaller model sizes. It utilizes a S$^2$-Wrapper for improved performance and supports high-resolution images (up to 1152x1152). The architecture allows for flexible integration of different vision encoders and LLMs, enabling tailored performance characteristics.

Quick Start & Requirements

Installation: pip install torch transformers accelerate pillow (or use the provided Docker image).
Prerequisites: CUDA 11.8/12, cuDNN 8.7.0 recommended. PyTorch, Transformers, Accelerate, Pillow, and optionally Apex and Flash-Attention for optimized performance.
Resources: Requires GPU with sufficient CUDA memory. Training was conducted on 8 A100 GPUs.
Demos & Docs: Hugging Face, ModelScope, Technical Report, Demo.

Highlighted Details

Bunny-Llama-3-8B-V is the first vision-language model based on Llama-3.
Bunny-4B (SigLIP + Phi-3-mini) outperforms larger MLLMs (7B, 13B) of similar size.
Supports high-resolution images up to 1152x1152 in v1.1 models.
Offers both merged weights and LoRA weights for flexibility.

Maintenance & Community

The project is actively updated, with recent releases including training strategies, data, and benchmarks like SpatialBot and MMR. Community interaction is facilitated via Hugging Face and ModelScope.

Licensing & Compatibility

The project code is licensed under Apache 2.0. However, it utilizes certain datasets and checkpoints that are subject to their original licenses, requiring users to comply with all terms.

Limitations & Caveats

The project relies on specific versions of dependencies for testing; compatibility with other versions is not guaranteed. Users must ensure compliance with the original licenses of included datasets and checkpoints.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days