molmo2 by allenai

Advanced vision-language model for video understanding and grounding

Created 6 months ago

684 stars

Top 48.8% on SourcePulse

Project Summary

Molmo2 is a state-of-the-art open-source vision-language model designed for advanced video understanding, pointing, and tracking. It addresses complex tasks requiring fine-grained visual grounding and comprehension across single images, multiple images, and videos. The model is targeted at researchers and engineers seeking to leverage cutting-edge capabilities in multimodal AI, offering exceptional performance and novel features for point-driven grounding tasks.

How It Works

Molmo2 employs a multi-stage training paradigm: initial pre-training on image captioning, NLP, and pointing tasks, followed by multitask supervised fine-tuning (SFT), and concluding with long-context SFT for enhanced video comprehension. The architecture integrates PyTorch, leveraging models like Qwen for language processing and SigLIP for vision. To handle extremely long video sequences (up to 384 frames), it utilizes Context Parallelism (CP), which shards computations across multiple GPUs, enabling efficient processing beyond single-GPU memory limits.

Quick Start & Requirements

Installation requires Python >= 3.11 and PyTorch. Clone the repository, then install dependencies with pip install torchcodec (recommended to install separately due to complex dependencies) and pip install -e .[all]. Docker is available via docker pull ghcr.io/allenai/molmo2:latest. Data is managed via MOLMO_DATA_DIR and HF_HOME environment variables; a script is provided for downloading many datasets, though some require manual acquisition due to licensing. Fast inference is supported via vLLM (>= 0.15.0).

Highlighted Details

Achieves state-of-the-art performance among open-source models for vision-language tasks.
Features novel point-driven grounding capabilities across single-image, multi-image, and video modalities.
Supports extended context lengths (36k+ tokens, 384 frames) through Context Parallelism for deep video understanding.
Offers fast inference via integration with vLLM.
Provides checkpoints at various training stages (Pre-Training, SFT, Long-Context SFT).

Maintenance & Community

The repository originates from Allen AI. Specific details regarding active maintenance, community channels (e.g., Discord, Slack), or notable contributors are not explicitly detailed in the provided README.

Licensing & Compatibility

The repository's README does not explicitly state the software license. This omission leaves the terms of use and compatibility for commercial or closed-source applications unclear.

Limitations & Caveats

Several datasets require manual download due to licensing restrictions. The torchcodec dependency has known complexities that may impact installation. The repository's README does not explicitly state the software license, leaving its terms of use and compatibility for commercial or closed-source applications unclear.

molmo2 by allenai

Explore Similar Projects

MOSS-VL by OpenMOSS

Video-LLaVA by mbzuai-oryx

LongVT by EvolvingLMMs-Lab

LongVA by EvolvingLMMs-Lab

QD-DETR by wjun0830

GroundingGPT by lzw-lzw

MotionLLM by IDEA-Research

PixelRefer by alibaba-damo-academy

VideoLLaMA3 by DAMO-NLP-SG

InternLM-XComposer by InternLM

Eagle by NVlabs

Qwen3-VL by QwenLM