LaVIN by luogen1996

Vision-language instruction tuning research paper

Created 2 years ago

525 stars

Top 60.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

LaVIN provides an efficient vision-language instruction tuning framework, Mixture-of-Modality Adaptation (MMA), for large language models. It enables models to handle both single- and multi-modal instructions with improved reasoning and training efficiency, targeting researchers and developers working with multimodal AI.

How It Works

MMA connects image encoders to LLMs using lightweight adapters, minimizing trainable parameters. A novel routing algorithm dynamically shifts reasoning paths for different instruction types, enhancing adaptability. This approach significantly reduces training time and computational resources compared to full model fine-tuning.

Quick Start & Requirements

Install: pip install -r requirements.txt and pip install -e . after setting up a conda environment with Python 3.8 and PyTorch 1.12.1.
Data: Requires LLaMA weights (official or unofficial HuggingFace), MSCOCO 2014 dataset, and ScienceQA dataset. Vicuna weights are also supported.
Hardware: Fine-tuning LaVIN-lite (7B) is possible on a single 3090 GPU (9GB VRAM), while larger models or faster training require multiple A100 GPUs.
Links: Project Page, Paper, Demo

Highlighted Details

Achieves 89.4 (7B) and 90.8 (13B) accuracy on ScienceQA.
Reduces training time to 1.4 hours and trainable parameters to 3.8M for 7B models.
Offers 4-bit training options for reduced memory footprint.
Demonstrates competitive performance on MME benchmark with limited data and cost.

Maintenance & Community

The project is associated with Xiamen University. Key updates include NeurIPS 2023 acceptance and release of evaluation codes, 4-bit training, and pre-trained checkpoints.

Licensing & Compatibility

The repository does not explicitly state a license. It acknowledges borrowing code and data from LLaMA, Stanford Alpaca, LLaVA, MiniGPT-4, and LLaMA-Adapter, which may have their own licensing terms. Commercial use requires careful review of these dependencies.

Limitations & Caveats

The README mentions that performance can be affected by the number of GPUs used for fine-tuning, and the team is working to address this. Support for additional modalities like audio and video is listed as a future TODO.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days