Open-LLaVA-NeXT  by xiaoachen98

Open-source code for training LLaVA-NeXT models

created 1 year ago
412 stars

Top 72.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation for training LLaVA-NeXT, a large multi-modal model. It aims to reproduce LLaVA-NeXT results and offers all training data and checkpoints for research. The project is suitable for researchers and practitioners in the large multi-modal model community.

How It Works

The training process involves two stages: feature alignment and visual instruction tuning. The feature alignment stage connects a frozen vision encoder (CLIP-L-336) to a frozen LLM using a subset of the LAION-CC-SBU dataset. The subsequent visual instruction tuning stage fine-tunes the entire model on 1 million open-source data points. The implementation is based on the LLaVA codebase with minimal modifications for ease of understanding.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n llava-next python=3.10), activate it, and install the package (pip install -e .). For training, install additional packages (pip install -e ".[train]").
  • Prerequisites: Python 3.10, CUDA (implied for training on A100s), flash-attn. Data preparation requires following Data.md.
  • Resources: Training requires significant GPU resources, specifically mentioning 16 x A100 (80GB) GPUs for pretraining (approx. 5 hours) and visual instruction tuning (approx. 20 hours). DeepSpeed ZeRO-3 can reduce memory requirements.

Highlighted Details

  • Reproduces LLaVA-Next results with open-sourced data and checkpoints.
  • Supports models like LLaVA-Next-Vicuna-7B and LLaVA-Next-LLaMA3-8B.
  • Utilizes CLIP-L-336 as the vision tower and various LLMs.
  • Offers options for fine-tuning the vision tower (--unfreeze_mm_vision_tower True) and processing images with variable resolutions (--image_aspect_ratio anyres).

Maintenance & Community

  • The project is actively maintained by xiaoachen98.
  • Acknowledgments include LLaVA, ShareGPT4V, and VLMEvalKit.
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository is available under an unspecified license. The citation uses a Zenodo DOI, suggesting a permissive research-oriented license, but explicit details are missing.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README does not explicitly state the license, which could impact commercial use. While it aims for reproducibility, the significant hardware requirements (16x A100 80GB) may limit accessibility for many users.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.