Fine-tuning script for Qwen2-VL models
Top 78.1% on sourcepulse
This repository provides a streamlined Python-based solution for fine-tuning the Qwen2-VL multimodal large language model, targeting researchers and developers who wish to adapt the model to custom datasets. It offers a simpler alternative to heavier frameworks like LLaMA-Factory, enabling quick experimentation and deployment of fine-tuned Qwen2-VL models.
How It Works
The project leverages Hugging Face's transformers
library and accelerate
for distributed training. It supports mixed-precision training (bfloat16 + float32) for improved validation loss and incorporates flash_attention2
for enhanced training efficiency. The code is designed for clarity, allowing users to easily integrate their own data and training loops.
Quick Start & Requirements
pip install -r requirements.txt
.av
for video data processing../finetune.sh
for single-GPU or ./finetune_distributed.sh
for multi-GPU fine-tuning. Testing can be done with test_on_official_model.py
and test_on_trained_model_by_us.py
.Highlighted Details
torchvision.io.VideoReader
for faster video data loading.Maintenance & Community
The project is maintained by zhangfaen. Updates indicate ongoing development, including support for newer Qwen2.5 models and performance optimizations.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. However, it is built upon Hugging Face models, which typically have permissive licenses suitable for commercial use, but this should be verified.
Limitations & Caveats
The provided toy data is minimal, and the training code does not include an evaluation step by default, requiring manual implementation for comprehensive assessment. The README notes that video data can be memory-intensive, potentially necessitating batch size reductions or configuration adjustments.
5 months ago
1 day