lmms-finetune  by zjysteven

Minimal codebase for finetuning large multimodal models

Created 1 year ago
361 stars

Top 77.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a minimal, unified codebase for fine-tuning a wide array of large multimodal models (LMMs), including image-only, interleaved, and video models. It targets researchers and practitioners seeking a straightforward and flexible framework for experimenting with and adapting LMMs, leveraging Hugging Face's official implementations for seamless integration and inference.

How It Works

The framework abstracts core fine-tuning components like model loading and data collation, enabling easy integration of new LMMs. It utilizes Hugging Face's transformers library, ensuring that fine-tuned models retain compatibility with standard Hugging Face inference pipelines. The codebase prioritizes simplicity and transparency, making it easier to understand, modify, and quickly iterate on fine-tuning strategies, including full fine-tuning, LoRA, and Q-LoRA for the LLM component, and full fine-tuning or LoRA for vision encoders.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a conda environment (conda create -n lmms-finetune python=3.10 -y; conda activate lmms-finetune), and install requirements (python -m pip install -r requirements.txt). Optionally install Flash Attention (python -m pip install --no-cache-dir --no-build-isolation flash-attn).
  • Prerequisites: Python 3.10+, PyTorch (version determined by environment), optional Flash Attention.
  • Resources: Requires significant GPU resources for fine-tuning LMMs. A Colab notebook is provided as a starting point.
  • Docs: supported_models.md, dataset.md, inference.md.

Highlighted Details

  • Supports a broad range of LMMs including LLaVA (1.5, 1.6, NeXT, Video, Onevision), Phi-3-Vision, Qwen-VL, Qwen2-VL, and Llama-3.2-Vision.
  • Offers fine-tuning for vision encoders and projectors.
  • Includes a Gradio web UI for interactive fine-tuning (python webui.py).
  • Provides a script (merge_lora_weights.py) for merging LoRA weights into a standalone model.

Maintenance & Community

  • Active maintainer: Yuqian Hong.
  • Initial maintainers: Jingyang Zhang, Yueqian Lin.
  • Inspired by and builds upon LLaVA, Qwen, and Hugging Face transformers.
  • Citation details provided.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • DeepSpeed may encounter issues with text-only samples if per_device_batch_size is 1 or if text-only instances dominate the dataset.
  • LLaVA-Onevision has a noted caveat.
  • For Qwen2.5 family support, installation of the latest transformers from GitHub is required.
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Romain Huet Romain Huet(Head of Developer Experience at OpenAI).

llama-models by meta-llama

0.1%
7k
Utilities for Llama models
Created 1 year ago
Updated 3 days ago
Feedback? Help us improve.