molmo  by allenai

Multimodal open language model code, training, and evaluation

created 8 months ago
582 stars

Top 56.4% on sourcepulse

GitHubView on GitHub
Project Summary

Molmo provides the codebase for training and deploying state-of-the-art multimodal open language models (VLMs). It targets researchers and developers working with vision-language tasks, offering a foundation for building and evaluating models that understand and generate content based on both images and text. The project aims to democratize access to advanced VLM capabilities.

How It Works

Molmo builds upon the OLMo codebase, integrating vision encoding capabilities and generative evaluation frameworks. It supports various vision encoders (CLIP, SigLIP, MetaCLIP, DINOv2) and LLMs (OLMo, Qwen2), allowing for flexible model configurations. The architecture is designed for both pre-training and fine-tuning, with a focus on enabling complex multimodal reasoning and generation tasks.

Quick Start & Requirements

  • Installation: git clone https://github.com/allenai/molmo.git && cd molmo && pip install -e .[all]
  • Prerequisites: Python 3.10+, PyTorch (OS-specific installation). For MolmoE-1B training: pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.
  • Data Setup: Requires downloading datasets via provided scripts, potentially taking up to a day. Environment variables MOLMO_DATA_DIR and HF_HOME can be set for custom data locations.
  • Resources: Training and evaluation of larger models (e.g., 72B) require multi-node setups and significant GPU resources. High-resolution evaluation may require FSDP to avoid Out-of-Memory errors.
  • Links: Video Demo, Public Demo, Technical Report, PixMo Datasets.

Highlighted Details

  • Offers a family of open VLMs: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B.
  • Released alongside the PixMo dataset collection, featuring diverse VLM training data like dense captions, instruction tuning triplets, and grounding annotations.
  • Supports evaluation on 11 downstream tasks, with options for low-resolution and high-resolution processing.
  • Codebase is compatible with Hugging Face models and includes scripts for converting weights.

Maintenance & Community

  • Developed by Allen Institute for AI (AI2).
  • Active development indicated by recent releases (Dec 2024).
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The codebase is released under the Apache 2.0 license.
  • Model weights are generally open, but specific terms for commercial use should be verified.

Limitations & Caveats

Some datasets (InfoQa, Scene-Text, PixMo-Clocks) require manual downloads or more complex setup. Minor differences may exist between released models and those trained with the current codebase due to potential download issues or dataset exclusions.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
4
Star History
186 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.