lmms-finetune  by zjysteven

Minimal codebase for finetuning large multimodal models

created 1 year ago
313 stars

Top 87.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a minimal, unified codebase for fine-tuning a wide array of large multimodal models (LMMs), including image-only, interleaved, and video models. It targets researchers and practitioners seeking a straightforward and flexible framework for experimenting with and adapting LMMs, leveraging Hugging Face's official implementations for seamless integration and inference.

How It Works

The framework abstracts core fine-tuning components like model loading and data collation, enabling easy integration of new LMMs. It utilizes Hugging Face's transformers library, ensuring that fine-tuned models retain compatibility with standard Hugging Face inference pipelines. The codebase prioritizes simplicity and transparency, making it easier to understand, modify, and quickly iterate on fine-tuning strategies, including full fine-tuning, LoRA, and Q-LoRA for the LLM component, and full fine-tuning or LoRA for vision encoders.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a conda environment (conda create -n lmms-finetune python=3.10 -y; conda activate lmms-finetune), and install requirements (python -m pip install -r requirements.txt). Optionally install Flash Attention (python -m pip install --no-cache-dir --no-build-isolation flash-attn).
  • Prerequisites: Python 3.10+, PyTorch (version determined by environment), optional Flash Attention.
  • Resources: Requires significant GPU resources for fine-tuning LMMs. A Colab notebook is provided as a starting point.
  • Docs: supported_models.md, dataset.md, inference.md.

Highlighted Details

  • Supports a broad range of LMMs including LLaVA (1.5, 1.6, NeXT, Video, Onevision), Phi-3-Vision, Qwen-VL, Qwen2-VL, and Llama-3.2-Vision.
  • Offers fine-tuning for vision encoders and projectors.
  • Includes a Gradio web UI for interactive fine-tuning (python webui.py).
  • Provides a script (merge_lora_weights.py) for merging LoRA weights into a standalone model.

Maintenance & Community

  • Active maintainer: Yuqian Hong.
  • Initial maintainers: Jingyang Zhang, Yueqian Lin.
  • Inspired by and builds upon LLaVA, Qwen, and Hugging Face transformers.
  • Citation details provided.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • DeepSpeed may encounter issues with text-only samples if per_device_batch_size is 1 or if text-only instances dominate the dataset.
  • LLaVA-Onevision has a noted caveat.
  • For Qwen2.5 family support, installation of the latest transformers from GitHub is required.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.