LLaVA-CoT by PKU-YuanGroup

VLM research paper for step-by-step reasoning

Created 1 year ago

2,110 stars

Top 21.0% on SourcePulse

Project Summary

LLaVA-CoT is a visual language model designed for spontaneous, systematic reasoning, targeting researchers and developers working with multimodal AI. It aims to improve the explainability and accuracy of visual question answering by enabling models to break down complex problems into logical steps, outperforming leading models on several challenging benchmarks.

How It Works

LLaVA-CoT employs a "chain-of-thought" (CoT) approach integrated into a vision-language model architecture. It processes visual input by first outlining the problem, then generating a detailed caption of relevant image elements, followed by a step-by-step reasoning process, and finally concluding with an answer. This structured approach allows the model to articulate its problem-solving methodology, enhancing transparency and potentially improving accuracy on tasks requiring complex inference.

Quick Start & Requirements

Inference: Use the same code as Llama-3.2-11B-Vision-Instruct. Pretrained weights are available on Hugging Face (Xkev/Llama-3.2V-11B-cot).
Finetuning: Requires llama-recipes and torchrun. Example command provided for finetuning with a custom dataset.
Dependencies: Python, PyTorch. Specific versions not detailed but implied by llama-recipes.
Resources: Finetuning requires significant computational resources (e.g., 8 GPUs mentioned in the example command).
Links: Hugging Face Demo, Hugging Face Model, Dataset, Paper.

Highlighted Details

Outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks.
Provides full training code and a generated dataset (LLaVA-CoT-100k).
Demonstrates step-by-step reasoning for both visual and science-based problems.
Inference script updated to not rely on VLMEvalKit.

Maintenance & Community

Active development with recent updates to inference scripts and release of training code.
Project initiated by PKU-YuanGroup.
Links to Hugging Face, arXiv, and X (formerly Twitter) for updates and demos.

Licensing & Compatibility

Code is released under Apache 2.0 license.
Service is a research preview for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT and OpenAI's Terms of Use.

Limitations & Caveats

The service is explicitly stated as a research preview for non-commercial use only, with usage subject to Llama 3.2 and OpenAI terms. There was a noted oversight in benchmark testing methodology regarding AI2D datasets, which has been corrected.

LLaVA-CoT by PKU-YuanGroup

Explore Similar Projects

VARGPT by VARGPT-family

Vision-R1 by Osilly

Visual-CoT by deepcs233

RLAIF-V by RLHF-V

Raspberry by daveshap

Seed-Coder by ByteDance-Seed

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

molmo by allenai

OLMoE by allenai

Mulberry by HJYao00

R1-V by StarsfieldAI

VLM-R1 by om-ai-lab