Discover and explore top open-source AI tools and projects—updated daily.
allenaiMultimodal model for spatial action reasoning
Top 97.4% on SourcePulse
MolmoAct provides an open-source framework for training and deploying Ai2's multimodal language model designed for action reasoning in spatial environments. It addresses the need for robots and agents to understand and act upon visual and textual instructions, offering a comprehensive solution for researchers and developers in robotics and AI. The project releases code, datasets, and pre-trained models, enabling replication and custom fine-tuning for real-world applications.
How It Works
MolmoAct is a multimodal language model that integrates visual perception with action reasoning. Its architecture processes visual inputs (images) and generates action sequences, leveraging depth estimation via Depth-Anything-V2 and VQVAE for tokenization. The training pipeline includes distinct pre-training, mid-training, and post-training (fine-tuning) stages, supporting both full-parameter and LoRA fine-tuning. Inference is facilitated through Hugging Face Transformers and vLLM for efficient deployment.
Quick Start & Requirements
Installation is recommended via the provided Dockerfile. Alternatively, clone the repository (git clone https://github.com/allenai/molmoact.git), navigate into the directory, and run pip install -e .[all]. Prerequisites include Python 3.11, PyTorch, and git. Training and fine-tuning demand significant resources, including multiple high-end GPUs (e.g., 8 A100/H100), substantial storage for datasets (MolmoAct Dataset, Pre-training Mixture, Mid-training Mixture), and dependencies like wget for model checkpoints. Official datasets and models are available on Hugging Face. Evaluation code is provided for SimplerEnv and LIBERO.
Highlighted Details
Maintenance & Community
The project is maintained by Allen Institute for AI (Ai2). For inquiries, collaborations, or support, contact haoquanf, jasonl, or jiafeid at allenai.org. Bug reports and feature requests should be submitted as GitHub issues. Specific community channels like Discord or Slack are not detailed in the README.
Licensing & Compatibility
MolmoAct is released under the Apache 2.0 license, primarily intended for research and educational use. While Apache 2.0 is generally permissive for commercial applications, users should consult the project's "Responsible Use Guidelines" for any specific restrictions or considerations regarding commercial deployment or closed-source integration.
Limitations & Caveats
Real-world evaluation content is marked as "coming soon." Training requires substantial GPU resources and storage. Data processing and downloads can be time-consuming, with some datasets necessitating manual acquisition. The README references "Responsible Use Guidelines" without providing a direct link, which may require users to seek further clarification.
1 month ago
Inactive
NVIDIA