LlamaV-o1 by mbzuai-oryx

Multimodal model for visual reasoning with step-by-step approach

created 6 months ago

304 stars

Top 88.9% on sourcepulse

Project Summary

LlamaV-o1 is a large multimodal model designed for step-by-step visual reasoning, targeting researchers and developers in computer vision and natural language processing. It aims to improve the accuracy and logical coherence of LLMs in complex visual tasks by introducing a novel benchmark and training methodology.

How It Works

LlamaV-o1 employs a combined multi-step curriculum learning and beam search approach. This strategy facilitates incremental skill development by guiding the model through progressively complex reasoning steps, while beam search optimizes the reasoning paths for efficiency and accuracy. This dual approach allows the model to tackle intricate visual reasoning tasks more effectively than standard methods.

Quick Start & Requirements

Install: Use Hugging Face transformers library.
Prerequisites: Python, transformers, torch. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for efficient operation.
Resources: Model weights and VCR-Bench dataset are available on Hugging Face. Sample inference code is provided.
Links: Model Weights, VCR-Bench Dataset, Technical Report.

Highlighted Details

Outperforms Gemini-1.5-Pro, GPT-4o-mini, and Llava-CoT on six multimodal benchmarks (MMStar, MMBench, MMVet, MathVista, AI2D, Hallusion), achieving a 3.8% average score improvement over Llava-CoT.
Introduces VCR-Bench, a novel benchmark for evaluating multi-step visual reasoning across eight diverse categories with over 1,000 samples.
Features a new metric to assess reasoning quality at the individual step level, emphasizing correctness and logical coherence.
Achieves 5x faster inference compared to Llava-CoT on complex reasoning tasks.

Maintenance & Community

The project is associated with Mohamed bin Zayed University of Artificial Intelligence. Further details on community channels or roadmaps are not provided in the README.

Licensing & Compatibility

The project is primarily distributed under the Apache 2.0 license. However, it notes that the service is provided for non-commercial purposes only, governed by the LLAMA 3.2 Community License Agreement and OpenAI's Terms of Use. This dual licensing may impose restrictions on commercial use or integration into closed-source products.

Limitations & Caveats

The model's performance is benchmarked against specific versions of other models, and direct comparisons may vary with future updates. The README mentions that more details about finetuning will be available soon, suggesting ongoing development. The licensing terms require careful review for commercial applications.

LlamaV-o1 by mbzuai-oryx

Explore Similar Projects

R1-Onevision by Fancy-MLLM

ViP-LLaVA by WisconsinAIVision

OneLLM by csuhan

Slow_Thinking_with_LLMs by RUCAIBox

Kimi-VL by MoonshotAI

Seed1.5-VL by ByteDance-Seed

Ovis by AIDC-AI

LLaVA-CoT by PKU-YuanGroup

Mulberry by HJYao00

smol-vision by merveenoyan

Kimi-k1.5 by MoonshotAI

MGM by dvlab-research