BakLLaVA  by SkunkworksAI

Multimodal model for visual instruction tuning, enhanced from LLaVA

created 1 year ago
711 stars

Top 49.2% on sourcepulse

GitHubView on GitHub
Project Summary

BakLLaVA is a project focused on enhancing multimodal capabilities in large language models by improving base models, training processes, datasets, and architectural components. It targets researchers and developers working with vision-language models, offering a framework for visual instruction tuning with GPT-4 level performance.

How It Works

BakLLaVA builds upon the LLaVA architecture, implementing modifications to the base models, training data, and training procedures. It aims to integrate vision and language understanding more effectively, enabling models to follow multimodal instructions and achieve state-of-the-art performance. The project emphasizes custom datasets and architectural changes for improved multimodal reasoning.

Quick Start & Requirements

  • Install: pip install -e . within a Python 3.10 conda environment.
  • Prerequisites: Python 3.10, PyTorch, Transformers, Gradio. For training: ninja, flash-attn.
  • Demo: Requires downloading LLaVA checkpoints. Launch controller (python -m llava.serve.controller), model worker (python -m llava.serve.model_worker), and Gradio server (python -m llava.serve.gradio_web_server).
  • Quantization: Supports 4-bit and 8-bit inference (--load-4bit, --load-8bit) for reduced VRAM usage (e.g., <8GB VRAM for 7B models).
  • Resources: Training requires significant GPU resources (e.g., 8x A100 80GB). Inference with quantization can run on GPUs with as little as 12GB VRAM.
  • Links: LLaVA GitHub, Model Zoo

Highlighted Details

  • Offers a Gradio Web UI for interactive demos and model comparison.
  • Supports CLI inference for direct image-based chat.
  • Includes detailed scripts and hyperparameters for both pretraining (feature alignment) and visual instruction tuning.
  • Introduces LLaVA-Lightning for significantly faster training cycles (e.g., 3 hours on 8x A100).
  • Provides a GPT-assisted evaluation pipeline for comprehensive model assessment.

Maintenance & Community

  • Collaboration with LAION and Ontocord.
  • Mentions Together Compute as a compute sponsor.
  • Related projects include LLaVA-Med and Otter.

Licensing & Compatibility

  • Data and checkpoints are licensed for research use only.
  • Usage is restricted by the license agreements of LLaMA, Vicuna, and GPT-4.
  • Dataset is CC BY NC 4.0 (non-commercial use).

Limitations & Caveats

  • The data and checkpoints are strictly for research purposes and prohibit commercial use.
  • Quantized inference may result in reduced accuracy compared to full-precision models.
  • Training requires substantial computational resources.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.