TinyGPT-V by DLYuanGod

Multimodal LLM research paper using small backbones for efficiency

Created 2 years ago

1,306 stars

Top 30.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Jesse Clark

Cofounder of Marqo

Project Summary

TinyGPT-V offers an efficient multimodal large language model (MLLM) solution by leveraging small backbone architectures. It targets researchers and developers seeking high-performance MLLMs with reduced computational requirements, achieving near state-of-the-art results with significantly smaller models.

How It Works

TinyGPT-V employs a novel approach that integrates a small, efficient language model (Phi-2) with a vision encoder. The training process is divided into stages, progressively enhancing multimodal capabilities. This strategy allows for efficient learning and adaptation, enabling the model to achieve strong performance on multimodal tasks while maintaining a low resource footprint.

Quick Start & Requirements

Installation: Clone the repository, create and activate a conda environment (conda env create -f environment.yml, conda activate tinygptv).
Prerequisites: Python 3.9+, PyTorch, transformers library, git-lfs. Requires downloading pretrained Phi-2 weights and model checkpoints for different stages.
Demo: Stage 4: python demo_v2.py --cfg-path eval_configs/tinygptv_stage4_eval.yaml --gpu-id 0. Stages 1-3: python demo.py --cfg-path eval_configs/tinygptv_stage1_2_3_eval.yaml --gpu-id 0.
Resource: ~8GB GPU memory for 16-bit LLM loading; can be reduced to <8GB with low_resource=True.
Links: Hugging Face Demo (Stage-4 v1), Hugging Face Demo (Stage-3), Pretrained Checkpoints.

Highlighted Details

Achieves 98% of InstructBLIP's performance, exceeding other models of the same period.
Utilizes the Phi-2 (2.7B) language model as its backbone.
Training involves multiple stages with specific dataset preparations and torchrun commands.
Requires manual patching of the transformers library's Phi model implementation.

Maintenance & Community

Project initiated by DLYuanGod, with contributions from multiple institutions.
Latest updates include paper revisions and checkpoint links.
BibTeX citation provided for academic use.

Licensing & Compatibility

Licensed under BSD 3-Clause License.
Codebases derived from Lavis also adhere to BSD 3-Clause License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The Stage 4 model's grounding abilities are noted as not yet performing optimally, with ongoing development. Training requires executing multiple torchrun commands sequentially for each stage.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days