honeybee  by khanrc

PyTorch code for multimodal LLM research paper (CVPR 2024 highlight)

Created 1 year ago
460 stars

Top 65.8% on SourcePulse

GitHubView on GitHub
Project Summary

Honeybee is an official PyTorch implementation of a locality-enhanced projector for multimodal large language models (MLLMs), presented at CVPR 2024 as a Highlight paper. It aims to improve multimodal understanding and generation by enhancing the interaction between visual and textual modalities, targeting researchers and developers working with MLLMs.

How It Works

Honeybee introduces a "locality-enhanced projector" designed to better capture fine-grained relationships between visual regions and textual tokens. This approach aims to overcome limitations of existing methods that might treat visual features more globally. The specific architectural details and algorithms are elaborated in the linked paper.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: PyTorch 2.0.1. Additional requirements for demo: pip install -r requirements_demo.txt.
  • Data: Requires downloading and organizing datasets (e.g., COYO100M, VQAv2, LLaVA150K) and configuring paths in configs/data_configs/train_dataset and configs/tasks.
  • Resources: Pretraining involves large datasets (COYO-700M subset). Strict reproduction of official results requires 8 GPUs.
  • Docs: paper, inference_example.ipynb, [gradio demo](python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last)

Highlighted Details

  • Achieved CVPR 2024 Highlight status.
  • Provides checkpoints for both pre-training (PT) and fine-tuning (FT) stages for 7B and 13B parameter models.
  • Demonstrates strong performance across various benchmarks including MMB, MME, SEED-I, LLaVA-w, MM-Vet, MMMU, and POPE.
  • Supports instruction tuning with various dataset combinations and configurable sampling weights.

Maintenance & Community

The project is the official implementation, suggesting active development. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

  • Source Code: Apache 2.0 License.
  • Pretrained Weights: CC-BY-NC 4.0 License.
  • Compatibility: The non-commercial license for weights restricts commercial use. Developed based on mPLUG-Owl (Apache 2.0).

Limitations & Caveats

The CC-BY-NC 4.0 license on pretrained weights prohibits commercial applications. Strict reproduction of results requires specific hardware configurations (8 GPUs).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.