LLaVAR  by SALT-NLP

Visual instruction tuning code/data for text-rich image understanding

created 2 years ago
269 stars

Top 96.2% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVAR enhances visual instruction tuning for text-rich image understanding, targeting researchers and developers working with multimodal AI models. It provides code and data to improve Optical Character Recognition (OCR) capabilities in large multimodal models, specifically by building upon the LLaVA architecture.

How It Works

LLaVAR modifies the LLaVA training and serving files to support Vicuna v1.1, using '</s>' as a separator. This adaptation, combined with specific pretraining and finetuning datasets focused on text-rich images, aims to significantly boost OCR performance. The project leverages CLIP's ViT-Large-336 as the vision tower and integrates custom instruction data to improve the model's ability to extract and understand text within images.

Quick Start & Requirements

  • Installation: Follow LLaVA's environment setup and model weight merging instructions.
  • Prerequisites: Requires LLaMA-13B model weights, Vicuna v1.1 compatibility, and openai/clip-vit-large-patch14-336 vision tower.
  • Data: Download pretraining and finetuning datasets from Huggingface.
  • Training: Uses torchrun with train_mem.py script, requiring significant GPU resources (e.g., 8 GPUs for pretraining).
  • Evaluation: Scripts provided for COCO evaluation and integration with MultimodalOCR.
  • Links: Project Page, Arxiv, Huggingface Datasets/Weights

Highlighted Details

  • Achieves an OCR score of 80 on the MME benchmark, a significant increase from LLaVA's 50.
  • Provides ready-to-use model checkpoints and finetuning datasets on Huggingface.
  • Includes metadata for LAION images used in pretraining and finetuning.
  • Offers specific scripts for both pretraining and finetuning, along with evaluation tools.

Maintenance & Community

The project is based on LLaVA and acknowledges contributions from the Vicuna and MultimodalOCR projects. Updates are regularly posted on the project page.

Licensing & Compatibility

The codebase is primarily derived from LLaVA. Specific licensing details for LLaVA and Vicuna should be consulted for compatibility, especially for commercial use.

Limitations & Caveats

The project requires merging model weights with LLaMA-13B and relies on the LLaVA codebase, inheriting its dependencies and potential limitations. The training scripts suggest substantial computational resources are necessary.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.3%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.