Multimodal model for visual instruction tuning, enhanced from LLaVA
Top 49.2% on sourcepulse
BakLLaVA is a project focused on enhancing multimodal capabilities in large language models by improving base models, training processes, datasets, and architectural components. It targets researchers and developers working with vision-language models, offering a framework for visual instruction tuning with GPT-4 level performance.
How It Works
BakLLaVA builds upon the LLaVA architecture, implementing modifications to the base models, training data, and training procedures. It aims to integrate vision and language understanding more effectively, enabling models to follow multimodal instructions and achieve state-of-the-art performance. The project emphasizes custom datasets and architectural changes for improved multimodal reasoning.
Quick Start & Requirements
pip install -e .
within a Python 3.10 conda environment.ninja
, flash-attn
.python -m llava.serve.controller
), model worker (python -m llava.serve.model_worker
), and Gradio server (python -m llava.serve.gradio_web_server
).--load-4bit
, --load-8bit
) for reduced VRAM usage (e.g., <8GB VRAM for 7B models).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day