LLaVA  by haotian-liu

Multimodal assistant with GPT-4 level capabilities

created 2 years ago
23,171 stars

Top 1.8% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA is an open-source project enabling large language and vision assistant capabilities, aiming to match or exceed GPT-4V performance. It's designed for researchers and developers working on multimodal AI, offering a robust framework for visual instruction tuning and a suite of pre-trained models.

How It Works

LLaVA employs a visual instruction tuning approach, connecting a frozen vision encoder (like CLIP) to a frozen large language model (LLM). This is achieved through a trainable projection layer. The model is then fine-tuned on a large dataset of multimodal instruction-following data, enabling it to understand and respond to visual prompts. This method allows for efficient training and achieves strong performance with relatively modest computational resources.

Quick Start & Requirements

  • Install: pip install -e . (Python 3.10+ recommended).
  • Prerequisites: PyTorch, Transformers, Accelerate, FlashAttention (optional but recommended). GPU with sufficient VRAM (e.g., 12GB for 4-bit 7B models, 24GB+ for larger models).
  • Demo: Gradio web UI can be launched locally.
  • Docs: LLaVA Project Page, Model Zoo.

Highlighted Details

  • Achieves state-of-the-art performance on multiple benchmarks, including visual question answering and multimodal reasoning.
  • Supports various model sizes (7B, 13B, 34B) and quantization (4-bit, 8-bit) for reduced VRAM usage.
  • Offers LoRA training support for efficient fine-tuning.
  • Recent LLaVA-NeXT versions support Llama-3 and Qwen-1.5, and demonstrate strong zero-shot video capabilities.

Maintenance & Community

The project is actively maintained by Haotian Liu and collaborators, with significant community contributions including integrations with llama.cpp, AutoGen, and SGLang. Active community support is available via Discord/Slack channels.

Licensing & Compatibility

LLaVA itself is typically released under permissive licenses (e.g., Apache 2.0), but usage of base models (like Llama-2, Vicuna) and datasets is subject to their original licenses, which may include restrictions on commercial use or redistribution. Users must comply with all underlying license terms.

Limitations & Caveats

While LLaVA-1.5 can be trained on a single 8-A100 node, achieving GPT-4V level capabilities often requires significant computational resources for training larger models. Some community integrations or specific features might be in preview or have experimental status.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
6
Star History
913 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.