Open-LLaVA-NeXT by xiaoachen98

Open-source code for training LLaVA-NeXT models

Created 1 year ago

430 stars

Top 69.0% on SourcePulse

Project Summary

This repository provides an open-source implementation for training LLaVA-NeXT, a large multi-modal model. It aims to reproduce LLaVA-NeXT results and offers all training data and checkpoints for research. The project is suitable for researchers and practitioners in the large multi-modal model community.

How It Works

The training process involves two stages: feature alignment and visual instruction tuning. The feature alignment stage connects a frozen vision encoder (CLIP-L-336) to a frozen LLM using a subset of the LAION-CC-SBU dataset. The subsequent visual instruction tuning stage fine-tunes the entire model on 1 million open-source data points. The implementation is based on the LLaVA codebase with minimal modifications for ease of understanding.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n llava-next python=3.10), activate it, and install the package (pip install -e .). For training, install additional packages (pip install -e ".[train]").
Prerequisites: Python 3.10, CUDA (implied for training on A100s), flash-attn. Data preparation requires following Data.md.
Resources: Training requires significant GPU resources, specifically mentioning 16 x A100 (80GB) GPUs for pretraining (approx. 5 hours) and visual instruction tuning (approx. 20 hours). DeepSpeed ZeRO-3 can reduce memory requirements.

Highlighted Details

Reproduces LLaVA-Next results with open-sourced data and checkpoints.
Supports models like LLaVA-Next-Vicuna-7B and LLaVA-Next-LLaMA3-8B.
Utilizes CLIP-L-336 as the vision tower and various LLMs.
Offers options for fine-tuning the vision tower (--unfreeze_mm_vision_tower True) and processing images with variable resolutions (--image_aspect_ratio anyres).

Maintenance & Community

The project is actively maintained by xiaoachen98.
Acknowledgments include LLaVA, ShareGPT4V, and VLMEvalKit.
No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository is available under an unspecified license. The citation uses a Zenodo DOI, suggesting a permissive research-oriented license, but explicit details are missing.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README does not explicitly state the license, which could impact commercial use. While it aims for reproducibility, the significant hardware requirements (16x A100 80GB) may limit accessibility for many users.

Open-LLaVA-NeXT by xiaoachen98

Explore Similar Projects

VARGPT by VARGPT-family

GPT4RoI by jshilong

METER by zdou0830

SEED by AILab-CVC

llava-phi by xmoanvaf

X-VLM by zengyan-97

LVM by ytongbai

molmo by allenai

open_flamingo by mlfoundations

minimind-v by jingyaogong

prismatic-vlms by TRI-ML

LLaVA by haotian-liu