Minimal codebase for finetuning large multimodal models
Top 87.4% on sourcepulse
This repository provides a minimal, unified codebase for fine-tuning a wide array of large multimodal models (LMMs), including image-only, interleaved, and video models. It targets researchers and practitioners seeking a straightforward and flexible framework for experimenting with and adapting LMMs, leveraging Hugging Face's official implementations for seamless integration and inference.
How It Works
The framework abstracts core fine-tuning components like model loading and data collation, enabling easy integration of new LMMs. It utilizes Hugging Face's transformers
library, ensuring that fine-tuned models retain compatibility with standard Hugging Face inference pipelines. The codebase prioritizes simplicity and transparency, making it easier to understand, modify, and quickly iterate on fine-tuning strategies, including full fine-tuning, LoRA, and Q-LoRA for the LLM component, and full fine-tuning or LoRA for vision encoders.
Quick Start & Requirements
conda create -n lmms-finetune python=3.10 -y; conda activate lmms-finetune
), and install requirements (python -m pip install -r requirements.txt
). Optionally install Flash Attention (python -m pip install --no-cache-dir --no-build-isolation flash-attn
).Highlighted Details
python webui.py
).merge_lora_weights.py
) for merging LoRA weights into a standalone model.Maintenance & Community
transformers
.Licensing & Compatibility
Limitations & Caveats
per_device_batch_size
is 1 or if text-only instances dominate the dataset.transformers
from GitHub is required.5 months ago
1 day