BLIVA by mlpc-ucsd

Multimodal LLM for text-rich visual question answering (AAAI 2024 paper)

Created 2 years ago

260 stars

Top 97.7% on SourcePulse

Project Summary

BLIVA is a multimodal large language model designed to excel at visual question answering, particularly for images containing significant amounts of text. It targets researchers and developers working with visual understanding tasks, offering improved performance on text-rich benchmarks.

How It Works

BLIVA builds upon the BLIP-2 architecture, integrating a visual encoder with a language model. Its novelty lies in its specific training methodology and data augmentation strategies tailored for text-heavy visual question answering. This approach allows BLIVA to effectively extract and interpret textual information embedded within images, leading to more accurate responses compared to general-purpose multimodal models.

Quick Start & Requirements

Install: pip install -e . after cloning the repository.
Prerequisites: Python 3.9, PyTorch. Vicuna version requires downloading Vicuna-7B weights and specifying paths. FlanT5 version automatically downloads weights.
Resources: Training requires 8x A6000 Ada GPUs. Inference is less demanding.
Links: Demo, Vicuna Weights, FlanT5 Weights, Paper.

Highlighted Details

Achieved No.1 in Color, Poster, and Commonsense Reasoning subtasks on the MME benchmark.
Outperforms strong baselines like InstructBLIP and LLaVA on text-rich VQA benchmarks.
Offers both Vicuna-7B and FlanT5-XXL versions.
Released a new dataset, YTTB-VQA, for text-rich visual question answering.

Maintenance & Community

The project is associated with UC San Diego and Coinbase. Key updates are announced via GitHub releases.

Licensing & Compatibility

Code: BSD 3-Clause License.
Vicuna Model Weights: Subject to LLaMA's model license.
FlanT5 Model Weights: Apache 2.0 License (available for commercial use).
YTTB-VQA Data: CC BY NC 4.0.

Limitations & Caveats

The Vicuna version's commercial use is restricted by LLaMA's license. While strong on text-rich VQA, performance on general VQA benchmarks may vary.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

MIC by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago

Updated 2 years ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo).

LLaVAR by SALT-NLP

Visual instruction tuning code/data for text-rich image understanding

Created 2 years ago

Updated 1 year ago

VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 11 months ago

Updated 8 months ago

Starred by

Chenlin Meng

Chenlin Meng(Cofounder of Pika).

ru-dolph by ai-forever

Hyper-tasking transformer for text-to-image, image classification, and VQA

Created 4 years ago

Updated 2 years ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

METER by zdou0830

Multimodal framework for vision-and-language transformer research

Created 4 years ago

Updated 3 years ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Show-o by showlab

Unified transformer research paper for multimodal tasks

Created 1 year ago

Updated 3 days ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

Multimodal-GPT by open-mmlab

Multimodal chatbot for visual/language instructions (research paper)

Created 2 years ago

Updated 2 years ago

Starred by

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

OFA by OFA-Sys

Unified sequence-to-sequence model for cross-modality, vision, and language tasks

Created 4 years ago

Updated 1 year ago

Starred by

Leandro von Werra

Leandro von Werra(Head of Research at Hugging Face),

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face), and

10 more.

smollm by huggingface

Lightweight AI models for text and vision tasks

Created 1 year ago

Updated 1 month ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

4 more.

Qwen-VL by QwenLM

Vision-language model for multimodal understanding, localization, and text reading

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

MoE vision-language model for multimodal understanding

Created 1 year ago

Updated 10 months ago

Feedback? Help us improve.