MLLM for training LLaVA-like models on limited hardware
Top 66.4% on sourcepulse
This project provides a framework for training and deploying multimodal large language models (MLLMs) based on the QwenLM architecture, specifically targeting users with limited hardware resources (e.g., RTX 3090/4090 24GB GPUs). It enables multimodal capabilities including single-image, multi-image, and video-based question answering and multi-turn conversations, with the goal of making advanced MLLM training accessible for personal projects.
How It Works
The core innovation lies in its implementation of Pipeline Parallelism (PP) combined with Data Parallelism (DP) using DeepSpeed. This approach distributes the model layers across multiple GPUs, allowing for the training of larger models than would otherwise fit into single GPU memory. The framework supports a two-stage training process, mirroring LLaVA's pretrain and supervised fine-tuning (SFT) stages, and offers custom data formats for continued training.
Quick Start & Requirements
pip install -e .
within a Python 3.8 conda environment.WEIGHT.md
and DATA.md
.device_map="auto"
for LLM), and CPU inference.4 months ago
1 day