LaVIN  by luogen1996

Vision-language instruction tuning research paper

created 2 years ago
522 stars

Top 61.2% on sourcepulse

GitHubView on GitHub
Project Summary

LaVIN provides an efficient vision-language instruction tuning framework, Mixture-of-Modality Adaptation (MMA), for large language models. It enables models to handle both single- and multi-modal instructions with improved reasoning and training efficiency, targeting researchers and developers working with multimodal AI.

How It Works

MMA connects image encoders to LLMs using lightweight adapters, minimizing trainable parameters. A novel routing algorithm dynamically shifts reasoning paths for different instruction types, enhancing adaptability. This approach significantly reduces training time and computational resources compared to full model fine-tuning.

Quick Start & Requirements

  • Install: pip install -r requirements.txt and pip install -e . after setting up a conda environment with Python 3.8 and PyTorch 1.12.1.
  • Data: Requires LLaMA weights (official or unofficial HuggingFace), MSCOCO 2014 dataset, and ScienceQA dataset. Vicuna weights are also supported.
  • Hardware: Fine-tuning LaVIN-lite (7B) is possible on a single 3090 GPU (9GB VRAM), while larger models or faster training require multiple A100 GPUs.
  • Links: Project Page, Paper, Demo

Highlighted Details

  • Achieves 89.4 (7B) and 90.8 (13B) accuracy on ScienceQA.
  • Reduces training time to 1.4 hours and trainable parameters to 3.8M for 7B models.
  • Offers 4-bit training options for reduced memory footprint.
  • Demonstrates competitive performance on MME benchmark with limited data and cost.

Maintenance & Community

The project is associated with Xiamen University. Key updates include NeurIPS 2023 acceptance and release of evaluation codes, 4-bit training, and pre-trained checkpoints.

Licensing & Compatibility

The repository does not explicitly state a license. It acknowledges borrowing code and data from LLaMA, Stanford Alpaca, LLaVA, MiniGPT-4, and LLaMA-Adapter, which may have their own licensing terms. Commercial use requires careful review of these dependencies.

Limitations & Caveats

The README mentions that performance can be affected by the number of GPUs used for fine-tuning, and the team is working to address this. Support for additional modalities like audio and video is listed as a future TODO.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Feedback? Help us improve.