Discover and explore top open-source AI tools and projects—updated daily.
allenaiAdvanced vision-language model for video understanding and grounding
Top 62.8% on SourcePulse
Molmo2 is a state-of-the-art open-source vision-language model designed for advanced video understanding, pointing, and tracking. It addresses complex tasks requiring fine-grained visual grounding and comprehension across single images, multiple images, and videos. The model is targeted at researchers and engineers seeking to leverage cutting-edge capabilities in multimodal AI, offering exceptional performance and novel features for point-driven grounding tasks.
How It Works
Molmo2 employs a multi-stage training paradigm: initial pre-training on image captioning, NLP, and pointing tasks, followed by multitask supervised fine-tuning (SFT), and concluding with long-context SFT for enhanced video comprehension. The architecture integrates PyTorch, leveraging models like Qwen for language processing and SigLIP for vision. To handle extremely long video sequences (up to 384 frames), it utilizes Context Parallelism (CP), which shards computations across multiple GPUs, enabling efficient processing beyond single-GPU memory limits.
Quick Start & Requirements
Installation requires Python >= 3.11 and PyTorch. Clone the repository, then install dependencies with pip install torchcodec (recommended to install separately due to complex dependencies) and pip install -e .[all]. Docker is available via docker pull ghcr.io/allenai/molmo2:latest. Data is managed via MOLMO_DATA_DIR and HF_HOME environment variables; a script is provided for downloading many datasets, though some require manual acquisition due to licensing. Fast inference is supported via vLLM (>= 0.15.0).
Highlighted Details
Maintenance & Community
The repository originates from Allen AI. Specific details regarding active maintenance, community channels (e.g., Discord, Slack), or notable contributors are not explicitly detailed in the provided README.
Licensing & Compatibility
The repository's README does not explicitly state the software license. This omission leaves the terms of use and compatibility for commercial or closed-source applications unclear.
Limitations & Caveats
Several datasets require manual download due to licensing restrictions. The torchcodec dependency has known complexities that may impact installation. The repository's README does not explicitly state the software license, leaving its terms of use and compatibility for commercial or closed-source applications unclear.
3 weeks ago
Inactive
InternLM
QwenLM