LLaVA-CoT  by PKU-YuanGroup

VLM research paper for step-by-step reasoning

created 8 months ago
2,043 stars

Top 22.2% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-CoT is a visual language model designed for spontaneous, systematic reasoning, targeting researchers and developers working with multimodal AI. It aims to improve the explainability and accuracy of visual question answering by enabling models to break down complex problems into logical steps, outperforming leading models on several challenging benchmarks.

How It Works

LLaVA-CoT employs a "chain-of-thought" (CoT) approach integrated into a vision-language model architecture. It processes visual input by first outlining the problem, then generating a detailed caption of relevant image elements, followed by a step-by-step reasoning process, and finally concluding with an answer. This structured approach allows the model to articulate its problem-solving methodology, enhancing transparency and potentially improving accuracy on tasks requiring complex inference.

Quick Start & Requirements

  • Inference: Use the same code as Llama-3.2-11B-Vision-Instruct. Pretrained weights are available on Hugging Face (Xkev/Llama-3.2V-11B-cot).
  • Finetuning: Requires llama-recipes and torchrun. Example command provided for finetuning with a custom dataset.
  • Dependencies: Python, PyTorch. Specific versions not detailed but implied by llama-recipes.
  • Resources: Finetuning requires significant computational resources (e.g., 8 GPUs mentioned in the example command).
  • Links: Hugging Face Demo, Hugging Face Model, Dataset, Paper.

Highlighted Details

  • Outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks.
  • Provides full training code and a generated dataset (LLaVA-CoT-100k).
  • Demonstrates step-by-step reasoning for both visual and science-based problems.
  • Inference script updated to not rely on VLMEvalKit.

Maintenance & Community

  • Active development with recent updates to inference scripts and release of training code.
  • Project initiated by PKU-YuanGroup.
  • Links to Hugging Face, arXiv, and X (formerly Twitter) for updates and demos.

Licensing & Compatibility

  • Code is released under Apache 2.0 license.
  • Service is a research preview for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT and OpenAI's Terms of Use.

Limitations & Caveats

The service is explicitly stated as a research preview for non-commercial use only, with usage subject to Llama 3.2 and OpenAI terms. There was a noted oversight in benchmark testing methodology regarding AI2D datasets, which has been corrected.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
81 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.