Vision-language instruction tuning research paper
Top 61.2% on sourcepulse
LaVIN provides an efficient vision-language instruction tuning framework, Mixture-of-Modality Adaptation (MMA), for large language models. It enables models to handle both single- and multi-modal instructions with improved reasoning and training efficiency, targeting researchers and developers working with multimodal AI.
How It Works
MMA connects image encoders to LLMs using lightweight adapters, minimizing trainable parameters. A novel routing algorithm dynamically shifts reasoning paths for different instruction types, enhancing adaptability. This approach significantly reduces training time and computational resources compared to full model fine-tuning.
Quick Start & Requirements
pip install -r requirements.txt
and pip install -e .
after setting up a conda environment with Python 3.8 and PyTorch 1.12.1.Highlighted Details
Maintenance & Community
The project is associated with Xiamen University. Key updates include NeurIPS 2023 acceptance and release of evaluation codes, 4-bit training, and pre-trained checkpoints.
Licensing & Compatibility
The repository does not explicitly state a license. It acknowledges borrowing code and data from LLaMA, Stanford Alpaca, LLaVA, MiniGPT-4, and LLaMA-Adapter, which may have their own licensing terms. Commercial use requires careful review of these dependencies.
Limitations & Caveats
The README mentions that performance can be affected by the number of GPUs used for fine-tuning, and the team is working to address this. Support for additional modalities like audio and video is listed as a future TODO.
1 year ago
1 week