Visual-CoT by deepcs233

Research paper advancing multimodal language models

Created 1 year ago

415 stars

Top 70.5% on SourcePulse

Project Summary

This repository provides the Visual CoT dataset and benchmark for advancing multi-modal language models (MLLMs) in chain-of-thought reasoning. It targets researchers and developers working with MLLMs, offering a comprehensive dataset of 438k question-answer pairs with bounding box annotations to enable fine-grained visual understanding and reasoning.

How It Works

The project introduces a multi-turn processing pipeline for MLLMs that dynamically focuses on specific visual regions, identified by bounding boxes, to generate interpretable reasoning steps. This approach enhances the model's ability to handle complex visual question answering tasks that require precise localization of information. The dataset is built upon the LLaVA framework, leveraging its architecture for visual-language integration.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e . and optionally pip install -e ".[train]".
Prerequisites: Python 3.10, Conda environment recommended. Training requires 8x A100 GPUs (80GB).
Resources: Training the feature alignment stage for VisCoT-13B takes ~5.5 hours on 8x A100. Visual instruction tuning takes ~60 hours for VisCoT-7b-224 on 8x A100.
Links: Project Page, Dataset, arXiv

Highlighted Details

NeurIPS'24 Spotlight paper.
438k question-answer pairs with intermediate bounding box annotations.
Supports multi-turn processing for dynamic visual focus.
Benchmark evaluates MLLMs on specific local region identification.
Offers pre-trained checkpoints for VisCoT-7B and VisCoT-13B models.

Maintenance & Community

The project is based on LLaVA and Vicuna. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

Code License: Apache License 2.0.
Model Weights: Usage must comply with the base LLM's (e.g., Vicuna) model license.

Limitations & Caveats

The dataset and models are intended solely for research purposes and reproducibility. Any deployed or commercial use is out of scope. Some dataset images may require registration or completion of forms for download.

Visual-CoT by deepcs233

Explore Similar Projects

DreamLLM by RunpeiDong

MMVP by tsb0601

cobra by h-zhao1997

LAMM by OpenGVLab

Efficient-Multimodal-LLMs-Survey by swordlidev

Multi-Modality-Arena by OpenGVLab

VLM2Vec by TIGER-AI-Lab

Awesome_Matching_Pretraining_Transfering by Paranioar

awesome-vlm-architectures by gokayfem

Rex-Omni by IDEA-Research

LLaVA-CoT by PKU-YuanGroup

molmo by allenai