molmoact by allenai

Multimodal model for spatial action reasoning

Created 10 months ago

370 stars

Top 76.4% on SourcePulse

Project Summary

MolmoAct provides an open-source framework for training and deploying Ai2's multimodal language model designed for action reasoning in spatial environments. It addresses the need for robots and agents to understand and act upon visual and textual instructions, offering a comprehensive solution for researchers and developers in robotics and AI. The project releases code, datasets, and pre-trained models, enabling replication and custom fine-tuning for real-world applications.

How It Works

MolmoAct is a multimodal language model that integrates visual perception with action reasoning. Its architecture processes visual inputs (images) and generates action sequences, leveraging depth estimation via Depth-Anything-V2 and VQVAE for tokenization. The training pipeline includes distinct pre-training, mid-training, and post-training (fine-tuning) stages, supporting both full-parameter and LoRA fine-tuning. Inference is facilitated through Hugging Face Transformers and vLLM for efficient deployment.

Quick Start & Requirements

Installation is recommended via the provided Dockerfile. Alternatively, clone the repository (git clone https://github.com/allenai/molmoact.git), navigate into the directory, and run pip install -e .[all]. Prerequisites include Python 3.11, PyTorch, and git. Training and fine-tuning demand significant resources, including multiple high-end GPUs (e.g., 8 A100/H100), substantial storage for datasets (MolmoAct Dataset, Pre-training Mixture, Mid-training Mixture), and dependencies like wget for model checkpoints. Official datasets and models are available on Hugging Face. Evaluation code is provided for SimplerEnv and LIBERO.

Highlighted Details

Fully open-source release of code, datasets, and models.
Comprehensive training pipeline replication for pre-training, mid-training, and post-training (LIBERO).
Supports fine-tuning with both full parameters and LoRA adapters.
Inference capabilities via Hugging Face Transformers and vLLM.
Integrated depth estimation using Depth-Anything-V2 and VQVAE.
Evaluation frameworks for SimplerEnv and LIBERO benchmarks.

Maintenance & Community

The project is maintained by Allen Institute for AI (Ai2). For inquiries, collaborations, or support, contact haoquanf, jasonl, or jiafeid at allenai.org. Bug reports and feature requests should be submitted as GitHub issues. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

MolmoAct is released under the Apache 2.0 license, primarily intended for research and educational use. While Apache 2.0 is generally permissive for commercial applications, users should consult the project's "Responsible Use Guidelines" for any specific restrictions or considerations regarding commercial deployment or closed-source integration.

Limitations & Caveats

Real-world evaluation content is marked as "coming soon." Training requires substantial GPU resources and storage. Data processing and downloads can be time-consuming, with some datasets necessitating manual acquisition. The README references "Responsible Use Guidelines" without providing a direct link, which may require users to seek further clarification.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days