BakLLaVA by SkunkworksAI

Multimodal model for visual instruction tuning, enhanced from LLaVA

Created 2 years ago

716 stars

Top 48.0% on SourcePulse

1 Expert Loves This Project

hugs

Creator of Selenium

Project Summary

BakLLaVA is a project focused on enhancing multimodal capabilities in large language models by improving base models, training processes, datasets, and architectural components. It targets researchers and developers working with vision-language models, offering a framework for visual instruction tuning with GPT-4 level performance.

How It Works

BakLLaVA builds upon the LLaVA architecture, implementing modifications to the base models, training data, and training procedures. It aims to integrate vision and language understanding more effectively, enabling models to follow multimodal instructions and achieve state-of-the-art performance. The project emphasizes custom datasets and architectural changes for improved multimodal reasoning.

Quick Start & Requirements

Install: pip install -e . within a Python 3.10 conda environment.
Prerequisites: Python 3.10, PyTorch, Transformers, Gradio. For training: ninja, flash-attn.
Demo: Requires downloading LLaVA checkpoints. Launch controller (python -m llava.serve.controller), model worker (python -m llava.serve.model_worker), and Gradio server (python -m llava.serve.gradio_web_server).
Quantization: Supports 4-bit and 8-bit inference (--load-4bit, --load-8bit) for reduced VRAM usage (e.g., <8GB VRAM for 7B models).
Resources: Training requires significant GPU resources (e.g., 8x A100 80GB). Inference with quantization can run on GPUs with as little as 12GB VRAM.
Links: LLaVA GitHub, Model Zoo

Highlighted Details

Offers a Gradio Web UI for interactive demos and model comparison.
Supports CLI inference for direct image-based chat.
Includes detailed scripts and hyperparameters for both pretraining (feature alignment) and visual instruction tuning.
Introduces LLaVA-Lightning for significantly faster training cycles (e.g., 3 hours on 8x A100).
Provides a GPT-assisted evaluation pipeline for comprehensive model assessment.

Maintenance & Community

Collaboration with LAION and Ontocord.
Mentions Together Compute as a compute sponsor.
Related projects include LLaVA-Med and Otter.

Licensing & Compatibility

Data and checkpoints are licensed for research use only.
Usage is restricted by the license agreements of LLaMA, Vicuna, and GPT-4.
Dataset is CC BY NC 4.0 (non-commercial use).

Limitations & Caveats

The data and checkpoints are strictly for research purposes and prohibit commercial use.
Quantized inference may result in reduced accuracy compared to full-precision models.
Training requires substantial computational resources.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI) and

Zhiqiang Xie

Zhiqiang Xie(Coauthor of SGLang).

LaCT by a1600012888

Test-Time Training framework for adaptable models

Created 7 months ago

Updated 6 days ago

VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 11 months ago

Updated 8 months ago

MPP-LLaVA by Coobiw

MLLM for training LLaVA-like models on limited hardware

Created 2 years ago

Updated 10 months ago

Vitron by SkyworkAI

Vision LLM research paper for pixel-level understanding, generation, segmentation & editing

Created 1 year ago

Updated 1 year ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

CoLLiE by OpenMOSS

LLM training toolkit for efficient collaborative tuning

Created 2 years ago

Updated 1 year ago

Omega-AI by dromara

Java DL framework for model training/inference, supporting multi-GPU

Created 6 years ago

Updated 3 months ago

Uni-MoE by HITsz-TMG

Research paper on scaling unified multimodal LLMs with MoE

Created 1 year ago

Updated 2 weeks ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Wing Lian

Wing Lian(Founder of Axolotl AI).

ChatGLM-finetune-LoRA by lich99

LoRA finetuning code for ChatGLM-6b

Created 2 years ago

Updated 2 years ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

Aria by rhymes-ai

Multimodal MoE model for video, document understanding, and dialog

Created 1 year ago

Updated 11 months ago

Starred by

Chuan Li

Chuan Li(Chief Scientific Officer at Lambda).

NeMo-Framework-Launcher by NVIDIA

Cloud-native tool for launching NeMo framework training jobs

Created 3 years ago

Updated 8 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Zack Li

Zack Li(Cofounder of Nexa AI), and

19 more.

LLaVA by haotian-liu

Multimodal assistant with GPT-4 level capabilities

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.