honeybee  by khanrc

PyTorch code for multimodal LLM research paper (CVPR 2024 highlight)

created 1 year ago
453 stars

Top 67.6% on sourcepulse

GitHubView on GitHub
Project Summary

Honeybee is an official PyTorch implementation of a locality-enhanced projector for multimodal large language models (MLLMs), presented at CVPR 2024 as a Highlight paper. It aims to improve multimodal understanding and generation by enhancing the interaction between visual and textual modalities, targeting researchers and developers working with MLLMs.

How It Works

Honeybee introduces a "locality-enhanced projector" designed to better capture fine-grained relationships between visual regions and textual tokens. This approach aims to overcome limitations of existing methods that might treat visual features more globally. The specific architectural details and algorithms are elaborated in the linked paper.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: PyTorch 2.0.1. Additional requirements for demo: pip install -r requirements_demo.txt.
  • Data: Requires downloading and organizing datasets (e.g., COYO100M, VQAv2, LLaVA150K) and configuring paths in configs/data_configs/train_dataset and configs/tasks.
  • Resources: Pretraining involves large datasets (COYO-700M subset). Strict reproduction of official results requires 8 GPUs.
  • Docs: paper, inference_example.ipynb, [gradio demo](python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last)

Highlighted Details

  • Achieved CVPR 2024 Highlight status.
  • Provides checkpoints for both pre-training (PT) and fine-tuning (FT) stages for 7B and 13B parameter models.
  • Demonstrates strong performance across various benchmarks including MMB, MME, SEED-I, LLaVA-w, MM-Vet, MMMU, and POPE.
  • Supports instruction tuning with various dataset combinations and configurable sampling weights.

Maintenance & Community

The project is the official implementation, suggesting active development. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

  • Source Code: Apache 2.0 License.
  • Pretrained Weights: CC-BY-NC 4.0 License.
  • Compatibility: The non-commercial license for weights restricts commercial use. Developed based on mPLUG-Owl (Apache 2.0).

Limitations & Caveats

The CC-BY-NC 4.0 license on pretrained weights prohibits commercial applications. Strict reproduction of results requires specific hardware configurations (8 GPUs).

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
258
Efficiently train foundation models with PyTorch
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.