Visual instruction tuning code/data for text-rich image understanding
Top 96.2% on sourcepulse
LLaVAR enhances visual instruction tuning for text-rich image understanding, targeting researchers and developers working with multimodal AI models. It provides code and data to improve Optical Character Recognition (OCR) capabilities in large multimodal models, specifically by building upon the LLaVA architecture.
How It Works
LLaVAR modifies the LLaVA training and serving files to support Vicuna v1.1, using '</s>' as a separator. This adaptation, combined with specific pretraining and finetuning datasets focused on text-rich images, aims to significantly boost OCR performance. The project leverages CLIP's ViT-Large-336 as the vision tower and integrates custom instruction data to improve the model's ability to extract and understand text within images.
Quick Start & Requirements
openai/clip-vit-large-patch14-336
vision tower.torchrun
with train_mem.py
script, requiring significant GPU resources (e.g., 8 GPUs for pretraining).Highlighted Details
Maintenance & Community
The project is based on LLaVA and acknowledges contributions from the Vicuna and MultimodalOCR projects. Updates are regularly posted on the project page.
Licensing & Compatibility
The codebase is primarily derived from LLaVA. Specific licensing details for LLaVA and Vicuna should be consulted for compatibility, especially for commercial use.
Limitations & Caveats
The project requires merging model weights with LLaMA-13B and relies on the LLaVA codebase, inheriting its dependencies and potential limitations. The training scripts suggest substantial computational resources are necessary.
1 year ago
Inactive