Visual-CoT  by deepcs233

Research paper advancing multimodal language models

created 1 year ago
354 stars

Top 79.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Visual CoT dataset and benchmark for advancing multi-modal language models (MLLMs) in chain-of-thought reasoning. It targets researchers and developers working with MLLMs, offering a comprehensive dataset of 438k question-answer pairs with bounding box annotations to enable fine-grained visual understanding and reasoning.

How It Works

The project introduces a multi-turn processing pipeline for MLLMs that dynamically focuses on specific visual regions, identified by bounding boxes, to generate interpretable reasoning steps. This approach enhances the model's ability to handle complex visual question answering tasks that require precise localization of information. The dataset is built upon the LLaVA framework, leveraging its architecture for visual-language integration.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . and optionally pip install -e ".[train]".
  • Prerequisites: Python 3.10, Conda environment recommended. Training requires 8x A100 GPUs (80GB).
  • Resources: Training the feature alignment stage for VisCoT-13B takes ~5.5 hours on 8x A100. Visual instruction tuning takes ~60 hours for VisCoT-7b-224 on 8x A100.
  • Links: Project Page, Dataset, arXiv

Highlighted Details

  • NeurIPS'24 Spotlight paper.
  • 438k question-answer pairs with intermediate bounding box annotations.
  • Supports multi-turn processing for dynamic visual focus.
  • Benchmark evaluates MLLMs on specific local region identification.
  • Offers pre-trained checkpoints for VisCoT-7B and VisCoT-13B models.

Maintenance & Community

The project is based on LLaVA and Vicuna. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

  • Code License: Apache License 2.0.
  • Model Weights: Usage must comply with the base LLM's (e.g., Vicuna) model license.

Limitations & Caveats

The dataset and models are intended solely for research purposes and reproducibility. Any deployed or commercial use is out of scope. Some dataset images may require registration or completion of forms for download.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
50 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.