Open-source code for training LLaVA-NeXT models
Top 72.0% on sourcepulse
This repository provides an open-source implementation for training LLaVA-NeXT, a large multi-modal model. It aims to reproduce LLaVA-NeXT results and offers all training data and checkpoints for research. The project is suitable for researchers and practitioners in the large multi-modal model community.
How It Works
The training process involves two stages: feature alignment and visual instruction tuning. The feature alignment stage connects a frozen vision encoder (CLIP-L-336) to a frozen LLM using a subset of the LAION-CC-SBU dataset. The subsequent visual instruction tuning stage fine-tunes the entire model on 1 million open-source data points. The implementation is based on the LLaVA codebase with minimal modifications for ease of understanding.
Quick Start & Requirements
conda create -n llava-next python=3.10
), activate it, and install the package (pip install -e .
). For training, install additional packages (pip install -e ".[train]"
).Data.md
.Highlighted Details
--unfreeze_mm_vision_tower True
) and processing images with variable resolutions (--image_aspect_ratio anyres
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not explicitly state the license, which could impact commercial use. While it aims for reproducibility, the significant hardware requirements (16x A100 80GB) may limit accessibility for many users.
9 months ago
1 day