Qwen2-VL-Finetune  by 2U1

Finetuning script for Qwen2-VL and Qwen2.5-VL models

Created 11 months ago
1,097 stars

Top 34.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation for fine-tuning Alibaba Cloud's Qwen2-VL and Qwen2.5-VL multimodal large language models. It targets researchers and developers working with vision-language models, offering a streamlined process for adapting these powerful models to custom datasets and tasks, including support for multi-image and video inputs.

How It Works

The fine-tuning process leverages Hugging Face Transformers and the Liger-Kernel for memory-efficient training. It supports various fine-tuning strategies including supervised fine-tuning (SFT), full fine-tuning, and parameter-efficient methods like LoRA and DoRA. The implementation is designed to handle diverse data formats, including LLaVA-spec JSON files, and offers flexibility in configuring training parameters, learning rates for different model components (vision tower, projector, language model), and quantization.

Quick Start & Requirements

  • Installation:
    • Docker: docker pull john119/vlm followed by docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash
    • From requirements.txt: pip install -r requirements.txt --index-url https://download.pytorch.org/whl/cu126 and pip install qwen-vl-utils flash-attn --no-build-isolation
    • From environment.yaml: conda env create -f environment.yaml, conda activate train, pip install qwen-vl-utils flash-attn --no-build-isolation
  • Prerequisites: Ubuntu 22.04, Nvidia Driver 550.120, CUDA 12.4. PyTorch installation requires cu126 wheel.
  • Dataset: Requires data formatted according to LLaVA specification (JSON file with image/video paths).
  • Links: LLaVA-NeXT, Mipha, Qwen2-VL-7B-Instruct, Liger-Kernel

Highlighted Details

  • Supports Qwen2.5-VL series models.
  • Enables Direct Preference Optimization (DPO) training.
  • Optimized for multi-image and video training.
  • Offers memory-efficient training options with Liger-Kernel, LoRA, DoRA, and 8-bit training.

Maintenance & Community

The project is actively updated, with recent additions including DPO support and Qwen2.5-VL compatibility. It is based on LLaVA-NeXT, Mipha, and Qwen2-VL-7B-Instruct projects.

Licensing & Compatibility

Licensed under the Apache-2.0 License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Liger-kernel is not compatible with QLoRA. A known issue with libcudnn_cnn_train.so.8 may require unsetting LD_LIBRARY_PATH. GRPO is listed as a future TODO.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
7
Star History
101 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Research Scientist at NVIDIA; Author of LMFlow), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.3%
23k
Multimodal assistant with GPT-4 level capabilities
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.