LLaVAR by SALT-NLP

Visual instruction tuning code/data for text-rich image understanding

Created 2 years ago

269 stars

Top 95.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

LLaVAR enhances visual instruction tuning for text-rich image understanding, targeting researchers and developers working with multimodal AI models. It provides code and data to improve Optical Character Recognition (OCR) capabilities in large multimodal models, specifically by building upon the LLaVA architecture.

How It Works

LLaVAR modifies the LLaVA training and serving files to support Vicuna v1.1, using '</s>' as a separator. This adaptation, combined with specific pretraining and finetuning datasets focused on text-rich images, aims to significantly boost OCR performance. The project leverages CLIP's ViT-Large-336 as the vision tower and integrates custom instruction data to improve the model's ability to extract and understand text within images.

Quick Start & Requirements

Installation: Follow LLaVA's environment setup and model weight merging instructions.
Prerequisites: Requires LLaMA-13B model weights, Vicuna v1.1 compatibility, and openai/clip-vit-large-patch14-336 vision tower.
Data: Download pretraining and finetuning datasets from Huggingface.
Training: Uses torchrun with train_mem.py script, requiring significant GPU resources (e.g., 8 GPUs for pretraining).
Evaluation: Scripts provided for COCO evaluation and integration with MultimodalOCR.
Links: Project Page, Arxiv, Huggingface Datasets/Weights

Highlighted Details

Achieves an OCR score of 80 on the MME benchmark, a significant increase from LLaVA's 50.
Provides ready-to-use model checkpoints and finetuning datasets on Huggingface.
Includes metadata for LAION images used in pretraining and finetuning.
Offers specific scripts for both pretraining and finetuning, along with evaluation tools.

Maintenance & Community

The project is based on LLaVA and acknowledges contributions from the Vicuna and MultimodalOCR projects. Updates are regularly posted on the project page.

Licensing & Compatibility

The codebase is primarily derived from LLaVA. Specific licensing details for LLaVA and Vicuna should be consulted for compatibility, especially for commercial use.

Limitations & Caveats

The project requires merging model weights with LLaMA-13B and relies on the LLaVA codebase, inheriting its dependencies and potential limitations. The training scripts suggest substantial computational resources are necessary.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days