Bunny  by BAAI-DCAI

Multimodal models for vision-language tasks

created 1 year ago
1,023 stars

Top 37.2% on sourcepulse

GitHubView on GitHub
Project Summary

Bunny is a family of lightweight, high-performance multimodal models designed for efficient vision-language understanding. It offers flexibility by supporting various plug-and-play vision encoders (EVA-CLIP, SigLIP) and language backbones (Llama-3-8B, Phi-3-mini, etc.), making it suitable for researchers and developers seeking adaptable multimodal solutions.

How It Works

Bunny employs a strategy of curating more informative training data to compensate for smaller model sizes. It utilizes a S$^2$-Wrapper for improved performance and supports high-resolution images (up to 1152x1152). The architecture allows for flexible integration of different vision encoders and LLMs, enabling tailored performance characteristics.

Quick Start & Requirements

  • Installation: pip install torch transformers accelerate pillow (or use the provided Docker image).
  • Prerequisites: CUDA 11.8/12, cuDNN 8.7.0 recommended. PyTorch, Transformers, Accelerate, Pillow, and optionally Apex and Flash-Attention for optimized performance.
  • Resources: Requires GPU with sufficient CUDA memory. Training was conducted on 8 A100 GPUs.
  • Demos & Docs: Hugging Face, ModelScope, Technical Report, Demo.

Highlighted Details

  • Bunny-Llama-3-8B-V is the first vision-language model based on Llama-3.
  • Bunny-4B (SigLIP + Phi-3-mini) outperforms larger MLLMs (7B, 13B) of similar size.
  • Supports high-resolution images up to 1152x1152 in v1.1 models.
  • Offers both merged weights and LoRA weights for flexibility.

Maintenance & Community

The project is actively updated, with recent releases including training strategies, data, and benchmarks like SpatialBot and MMR. Community interaction is facilitated via Hugging Face and ModelScope.

Licensing & Compatibility

The project code is licensed under Apache 2.0. However, it utilizes certain datasets and checkpoints that are subject to their original licenses, requiring users to comply with all terms.

Limitations & Caveats

The project relies on specific versions of dependencies for testing; compatibility with other versions is not guaranteed. Users must ensure compliance with the original licenses of included datasets and checkpoints.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.