lmms-finetune by zjysteven

Minimal codebase for finetuning large multimodal models

Created 1 year ago

361 stars

Top 77.8% on SourcePulse

Project Summary

This repository provides a minimal, unified codebase for fine-tuning a wide array of large multimodal models (LMMs), including image-only, interleaved, and video models. It targets researchers and practitioners seeking a straightforward and flexible framework for experimenting with and adapting LMMs, leveraging Hugging Face's official implementations for seamless integration and inference.

How It Works

The framework abstracts core fine-tuning components like model loading and data collation, enabling easy integration of new LMMs. It utilizes Hugging Face's transformers library, ensuring that fine-tuned models retain compatibility with standard Hugging Face inference pipelines. The codebase prioritizes simplicity and transparency, making it easier to understand, modify, and quickly iterate on fine-tuning strategies, including full fine-tuning, LoRA, and Q-LoRA for the LLM component, and full fine-tuning or LoRA for vision encoders.

Quick Start & Requirements

Install: Clone the repository, create and activate a conda environment (conda create -n lmms-finetune python=3.10 -y; conda activate lmms-finetune), and install requirements (python -m pip install -r requirements.txt). Optionally install Flash Attention (python -m pip install --no-cache-dir --no-build-isolation flash-attn).
Prerequisites: Python 3.10+, PyTorch (version determined by environment), optional Flash Attention.
Resources: Requires significant GPU resources for fine-tuning LMMs. A Colab notebook is provided as a starting point.
Docs: supported_models.md, dataset.md, inference.md.

Highlighted Details

Supports a broad range of LMMs including LLaVA (1.5, 1.6, NeXT, Video, Onevision), Phi-3-Vision, Qwen-VL, Qwen2-VL, and Llama-3.2-Vision.
Offers fine-tuning for vision encoders and projectors.
Includes a Gradio web UI for interactive fine-tuning (python webui.py).
Provides a script (merge_lora_weights.py) for merging LoRA weights into a standalone model.

Maintenance & Community

Active maintainer: Yuqian Hong.
Initial maintainers: Jingyang Zhang, Yueqian Lin.
Inspired by and builds upon LLaVA, Qwen, and Hugging Face transformers.
Citation details provided.

Licensing & Compatibility

License: Apache-2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

DeepSpeed may encounter issues with text-only samples if per_device_batch_size is 1 or if text-only instances dominate the dataset.
LLaVA-Onevision has a noted caveat.
For Qwen2.5 family support, installation of the latest transformers from GitHub is required.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days