finetune-Qwen2-VL by zhangfaen

Fine-tuning script for Qwen2-VL models

Created 1 year ago

385 stars

Top 74.3% on SourcePulse

Project Summary

This repository provides a streamlined Python-based solution for fine-tuning the Qwen2-VL multimodal large language model, targeting researchers and developers who wish to adapt the model to custom datasets. It offers a simpler alternative to heavier frameworks like LLaMA-Factory, enabling quick experimentation and deployment of fine-tuned Qwen2-VL models.

How It Works

The project leverages Hugging Face's transformers library and accelerate for distributed training. It supports mixed-precision training (bfloat16 + float32) for improved validation loss and incorporates flash_attention2 for enhanced training efficiency. The code is designed for clarity, allowing users to easily integrate their own data and training loops.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Python 3.10, CUDA (implied for multi-GPU), av for video data processing.
Usage: Run ./finetune.sh for single-GPU or ./finetune_distributed.sh for multi-GPU fine-tuning. Testing can be done with test_on_official_model.py and test_on_trained_model_by_us.py.
Resources: Video data processing requires significant GPU RAM; batch size may need adjustment.

Highlighted Details

Supports fine-tuning Qwen2.5-VL-3B (as of 2025/02/08).
Utilizes torchvision.io.VideoReader for faster video data loading.
Includes example fine-tuning scripts and toy data for quick start.
Demonstrates fine-tuning with multi-GPU using Hugging Face Accelerate and DeepSpeed plugins.

Maintenance & Community

The project is maintained by zhangfaen. Updates indicate ongoing development, including support for newer Qwen2.5 models and performance optimizations.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it is built upon Hugging Face models, which typically have permissive licenses suitable for commercial use, but this should be verified.

Limitations & Caveats

The provided toy data is minimal, and the training code does not include an evaluation step by default, requiring manual implementation for comprehensive assessment. The README notes that video data can be memory-intensive, potentially necessitating batch size reductions or configuration adjustments.

finetune-Qwen2-VL by zhangfaen

Explore Similar Projects

Inf-CLIP by DAMO-NLP-SG

Open-R1-Video by Wang-Xiaodong1899

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

Long-RL by NVlabs

MPP-LLaVA by Coobiw

fms-fsdp by foundation-model-stack

livecc by showlab

X-VLM by zengyan-97

SpargeAttn by thu-ml

vdvae by openai

Qwen-VL-Series-Finetune by 2U1

FastVideo by hao-ai-lab