molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

854 stars

Top 41.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Phil Wang

Prolific Research Paper Implementer

Project Summary

Molmo provides the codebase for training and deploying state-of-the-art multimodal open language models (VLMs). It targets researchers and developers working with vision-language tasks, offering a foundation for building and evaluating models that understand and generate content based on both images and text. The project aims to democratize access to advanced VLM capabilities.

How It Works

Molmo builds upon the OLMo codebase, integrating vision encoding capabilities and generative evaluation frameworks. It supports various vision encoders (CLIP, SigLIP, MetaCLIP, DINOv2) and LLMs (OLMo, Qwen2), allowing for flexible model configurations. The architecture is designed for both pre-training and fine-tuning, with a focus on enabling complex multimodal reasoning and generation tasks.

Quick Start & Requirements

Installation: git clone https://github.com/allenai/molmo.git && cd molmo && pip install -e .[all]
Prerequisites: Python 3.10+, PyTorch (OS-specific installation). For MolmoE-1B training: pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.
Data Setup: Requires downloading datasets via provided scripts, potentially taking up to a day. Environment variables MOLMO_DATA_DIR and HF_HOME can be set for custom data locations.
Resources: Training and evaluation of larger models (e.g., 72B) require multi-node setups and significant GPU resources. High-resolution evaluation may require FSDP to avoid Out-of-Memory errors.
Links: Video Demo, Public Demo, Technical Report, PixMo Datasets.

Highlighted Details

Offers a family of open VLMs: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B.
Released alongside the PixMo dataset collection, featuring diverse VLM training data like dense captions, instruction tuning triplets, and grounding annotations.
Supports evaluation on 11 downstream tasks, with options for low-resolution and high-resolution processing.
Codebase is compatible with Hugging Face models and includes scripts for converting weights.

Maintenance & Community

Developed by Allen Institute for AI (AI2).
Active development indicated by recent releases (Dec 2024).
Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

The codebase is released under the Apache 2.0 license.
Model weights are generally open, but specific terms for commercial use should be verified.

Limitations & Caveats

Some datasets (InfoQa, Scene-Text, PixMo-Clocks) require manual downloads or more complex setup. Minor differences may exist between released models and those trained with the current codebase due to potential download issues or dataset exclusions.

molmo by allenai

Explore Similar Projects

cobra by h-zhao1997

RLAIF-V by RLHF-V

Awesome-CV-Foundational-Models by awaisrauf

VLM2Vec by TIGER-AI-Lab

Awesome_Matching_Pretraining_Transfering by Paranioar

X-VLM by zengyan-97

mistral by stanford-crfm

awesome-vlm-architectures by gokayfem

Vary by Ucas-HaoranWei

open_flamingo by mlfoundations

EasyR1 by hiyouga

FlagAI by FlagAI-Open