BLIVA  by mlpc-ucsd

Multimodal LLM for text-rich visual question answering (AAAI 2024 paper)

Created 2 years ago
261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

BLIVA is a multimodal large language model designed to excel at visual question answering, particularly for images containing significant amounts of text. It targets researchers and developers working with visual understanding tasks, offering improved performance on text-rich benchmarks.

How It Works

BLIVA builds upon the BLIP-2 architecture, integrating a visual encoder with a language model. Its novelty lies in its specific training methodology and data augmentation strategies tailored for text-heavy visual question answering. This approach allows BLIVA to effectively extract and interpret textual information embedded within images, leading to more accurate responses compared to general-purpose multimodal models.

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository.
  • Prerequisites: Python 3.9, PyTorch. Vicuna version requires downloading Vicuna-7B weights and specifying paths. FlanT5 version automatically downloads weights.
  • Resources: Training requires 8x A6000 Ada GPUs. Inference is less demanding.
  • Links: Demo, Vicuna Weights, FlanT5 Weights, Paper.

Highlighted Details

  • Achieved No.1 in Color, Poster, and Commonsense Reasoning subtasks on the MME benchmark.
  • Outperforms strong baselines like InstructBLIP and LLaVA on text-rich VQA benchmarks.
  • Offers both Vicuna-7B and FlanT5-XXL versions.
  • Released a new dataset, YTTB-VQA, for text-rich visual question answering.

Maintenance & Community

The project is associated with UC San Diego and Coinbase. Key updates are announced via GitHub releases.

Licensing & Compatibility

  • Code: BSD 3-Clause License.
  • Vicuna Model Weights: Subject to LLaMA's model license.
  • FlanT5 Model Weights: Apache 2.0 License (available for commercial use).
  • YTTB-VQA Data: CC BY NC 4.0.

Limitations & Caveats

The Vicuna version's commercial use is restricted by LLaMA's license. While strong on text-rich VQA, performance on general VQA benchmarks may vary.

Health Check
Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.