TinyGPT-V  by DLYuanGod

Multimodal LLM research paper using small backbones for efficiency

created 1 year ago
1,295 stars

Top 31.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TinyGPT-V offers an efficient multimodal large language model (MLLM) solution by leveraging small backbone architectures. It targets researchers and developers seeking high-performance MLLMs with reduced computational requirements, achieving near state-of-the-art results with significantly smaller models.

How It Works

TinyGPT-V employs a novel approach that integrates a small, efficient language model (Phi-2) with a vision encoder. The training process is divided into stages, progressively enhancing multimodal capabilities. This strategy allows for efficient learning and adaptation, enabling the model to achieve strong performance on multimodal tasks while maintaining a low resource footprint.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a conda environment (conda env create -f environment.yml, conda activate tinygptv).
  • Prerequisites: Python 3.9+, PyTorch, transformers library, git-lfs. Requires downloading pretrained Phi-2 weights and model checkpoints for different stages.
  • Demo: Stage 4: python demo_v2.py --cfg-path eval_configs/tinygptv_stage4_eval.yaml --gpu-id 0. Stages 1-3: python demo.py --cfg-path eval_configs/tinygptv_stage1_2_3_eval.yaml --gpu-id 0.
  • Resource: ~8GB GPU memory for 16-bit LLM loading; can be reduced to <8GB with low_resource=True.
  • Links: Hugging Face Demo (Stage-4 v1), Hugging Face Demo (Stage-3), Pretrained Checkpoints.

Highlighted Details

  • Achieves 98% of InstructBLIP's performance, exceeding other models of the same period.
  • Utilizes the Phi-2 (2.7B) language model as its backbone.
  • Training involves multiple stages with specific dataset preparations and torchrun commands.
  • Requires manual patching of the transformers library's Phi model implementation.

Maintenance & Community

  • Project initiated by DLYuanGod, with contributions from multiple institutions.
  • Latest updates include paper revisions and checkpoint links.
  • BibTeX citation provided for academic use.

Licensing & Compatibility

  • Licensed under BSD 3-Clause License.
  • Codebases derived from Lavis also adhere to BSD 3-Clause License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The Stage 4 model's grounding abilities are noted as not yet performing optimally, with ongoing development. Training requires executing multiple torchrun commands sequentially for each stage.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.