honeybee by khanrc

PyTorch code for multimodal LLM research paper (CVPR 2024 highlight)

Created 2 years ago

462 stars

Top 65.5% on SourcePulse

Project Summary

Honeybee is an official PyTorch implementation of a locality-enhanced projector for multimodal large language models (MLLMs), presented at CVPR 2024 as a Highlight paper. It aims to improve multimodal understanding and generation by enhancing the interaction between visual and textual modalities, targeting researchers and developers working with MLLMs.

How It Works

Honeybee introduces a "locality-enhanced projector" designed to better capture fine-grained relationships between visual regions and textual tokens. This approach aims to overcome limitations of existing methods that might treat visual features more globally. The specific architectural details and algorithms are elaborated in the linked paper.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: PyTorch 2.0.1. Additional requirements for demo: pip install -r requirements_demo.txt.
Data: Requires downloading and organizing datasets (e.g., COYO100M, VQAv2, LLaVA150K) and configuring paths in configs/data_configs/train_dataset and configs/tasks.
Resources: Pretraining involves large datasets (COYO-700M subset). Strict reproduction of official results requires 8 GPUs.
Docs: paper, inference_example.ipynb, [gradio demo](python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last)

Highlighted Details

Achieved CVPR 2024 Highlight status.
Provides checkpoints for both pre-training (PT) and fine-tuning (FT) stages for 7B and 13B parameter models.
Demonstrates strong performance across various benchmarks including MMB, MME, SEED-I, LLaVA-w, MM-Vet, MMMU, and POPE.
Supports instruction tuning with various dataset combinations and configurable sampling weights.

Maintenance & Community

The project is the official implementation, suggesting active development. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

Source Code: Apache 2.0 License.
Pretrained Weights: CC-BY-NC 4.0 License.
Compatibility: The non-commercial license for weights restricts commercial use. Developed based on mPLUG-Owl (Apache 2.0).

Limitations & Caveats

The CC-BY-NC 4.0 license on pretrained weights prohibits commercial applications. Strict reproduction of results requires specific hardware configurations (8 GPUs).

honeybee by khanrc

Explore Similar Projects

bc-omni by westlake-baichuan-mllm

lynx-llm by bytedance

SmartEdit by TencentARC

Open-Qwen2VL by Victorwz

VARGPT by VARGPT-family

MedTrinity-25M by UCSC-VLAA

METER by zdou0830

X-VLM by zengyan-97

BiomedGPT by taokz

OpenMoE by XueFuzhao

PaddleMIX by PaddlePaddle

lingua by facebookresearch