Multimodal models for vision-language tasks
Top 37.2% on sourcepulse
Bunny is a family of lightweight, high-performance multimodal models designed for efficient vision-language understanding. It offers flexibility by supporting various plug-and-play vision encoders (EVA-CLIP, SigLIP) and language backbones (Llama-3-8B, Phi-3-mini, etc.), making it suitable for researchers and developers seeking adaptable multimodal solutions.
How It Works
Bunny employs a strategy of curating more informative training data to compensate for smaller model sizes. It utilizes a S$^2$-Wrapper for improved performance and supports high-resolution images (up to 1152x1152). The architecture allows for flexible integration of different vision encoders and LLMs, enabling tailored performance characteristics.
Quick Start & Requirements
pip install torch transformers accelerate pillow
(or use the provided Docker image).Highlighted Details
Maintenance & Community
The project is actively updated, with recent releases including training strategies, data, and benchmarks like SpatialBot and MMR. Community interaction is facilitated via Hugging Face and ModelScope.
Licensing & Compatibility
The project code is licensed under Apache 2.0. However, it utilizes certain datasets and checkpoints that are subject to their original licenses, requiring users to comply with all terms.
Limitations & Caveats
The project relies on specific versions of dependencies for testing; compatibility with other versions is not guaranteed. Users must ensure compliance with the original licenses of included datasets and checkpoints.
8 months ago
1 day