LLaVA by haotian-liu

Multimodal assistant with GPT-4 level capabilities

Created 2 years ago

24,279 stars

Top 1.7% on SourcePulse

View on GitHub

21 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Author of Hugging Face Diffusers; Research Engineer at Mistral

and 17 more!

Project Summary

LLaVA is an open-source project enabling large language and vision assistant capabilities, aiming to match or exceed GPT-4V performance. It's designed for researchers and developers working on multimodal AI, offering a robust framework for visual instruction tuning and a suite of pre-trained models.

How It Works

LLaVA employs a visual instruction tuning approach, connecting a frozen vision encoder (like CLIP) to a frozen large language model (LLM). This is achieved through a trainable projection layer. The model is then fine-tuned on a large dataset of multimodal instruction-following data, enabling it to understand and respond to visual prompts. This method allows for efficient training and achieves strong performance with relatively modest computational resources.

Quick Start & Requirements

Install: pip install -e . (Python 3.10+ recommended).
Prerequisites: PyTorch, Transformers, Accelerate, FlashAttention (optional but recommended). GPU with sufficient VRAM (e.g., 12GB for 4-bit 7B models, 24GB+ for larger models).
Demo: Gradio web UI can be launched locally.
Docs: LLaVA Project Page, Model Zoo.

Highlighted Details

Achieves state-of-the-art performance on multiple benchmarks, including visual question answering and multimodal reasoning.
Supports various model sizes (7B, 13B, 34B) and quantization (4-bit, 8-bit) for reduced VRAM usage.
Offers LoRA training support for efficient fine-tuning.
Recent LLaVA-NeXT versions support Llama-3 and Qwen-1.5, and demonstrate strong zero-shot video capabilities.

Maintenance & Community

The project is actively maintained by Haotian Liu and collaborators, with significant community contributions including integrations with llama.cpp, AutoGen, and SGLang. Active community support is available via Discord/Slack channels.

Licensing & Compatibility

LLaVA itself is typically released under permissive licenses (e.g., Apache 2.0), but usage of base models (like Llama-2, Vicuna) and datasets is subject to their original licenses, which may include restrictions on commercial use or redistribution. Users must comply with all underlying license terms.

Limitations & Caveats

While LLaVA-1.5 can be trained on a single 8-A100 node, achieving GPT-4V level capabilities often requires significant computational resources for training larger models. Some community integrations or specific features might be in preview or have experimental status.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

151 stars in the last 30 days