Qwen-VL-Series-Finetune by 2U1

Finetuning script for Qwen2-VL and Qwen2.5-VL models

Created 1 year ago

1,427 stars

Top 28.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jasper Zhang

Cofounder of Hyperbolic

Project Summary

This repository provides an open-source implementation for fine-tuning Alibaba Cloud's Qwen2-VL and Qwen2.5-VL multimodal large language models. It targets researchers and developers working with vision-language models, offering a streamlined process for adapting these powerful models to custom datasets and tasks, including support for multi-image and video inputs.

How It Works

The fine-tuning process leverages Hugging Face Transformers and the Liger-Kernel for memory-efficient training. It supports various fine-tuning strategies including supervised fine-tuning (SFT), full fine-tuning, and parameter-efficient methods like LoRA and DoRA. The implementation is designed to handle diverse data formats, including LLaVA-spec JSON files, and offers flexibility in configuring training parameters, learning rates for different model components (vision tower, projector, language model), and quantization.

Quick Start & Requirements

Installation:
- Docker: docker pull john119/vlm followed by docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash
- From requirements.txt: pip install -r requirements.txt --index-url https://download.pytorch.org/whl/cu126 and pip install qwen-vl-utils flash-attn --no-build-isolation
- From environment.yaml: conda env create -f environment.yaml, conda activate train, pip install qwen-vl-utils flash-attn --no-build-isolation
Prerequisites: Ubuntu 22.04, Nvidia Driver 550.120, CUDA 12.4. PyTorch installation requires cu126 wheel.
Dataset: Requires data formatted according to LLaVA specification (JSON file with image/video paths).
Links: LLaVA-NeXT, Mipha, Qwen2-VL-7B-Instruct, Liger-Kernel

Highlighted Details

Supports Qwen2.5-VL series models.
Enables Direct Preference Optimization (DPO) training.
Optimized for multi-image and video training.
Offers memory-efficient training options with Liger-Kernel, LoRA, DoRA, and 8-bit training.

Maintenance & Community

The project is actively updated, with recent additions including DPO support and Qwen2.5-VL compatibility. It is based on LLaVA-NeXT, Mipha, and Qwen2-VL-7B-Instruct projects.

Licensing & Compatibility

Licensed under the Apache-2.0 License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Liger-kernel is not compatible with QLoRA. A known issue with libcudnn_cnn_train.so.8 may require unsetting LD_LIBRARY_PATH. GRPO is listed as a future TODO.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

97 stars in the last 30 days