Magma by microsoft

Multimodal AI agent foundation model research paper

Created 1 year ago

1,889 stars

Top 22.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Travis Fischer

Founder of Agentic

Jianwei Yang

Research Scientist at Meta Superintelligence Lab

Project Summary

Magma is a foundation model for multimodal AI agents, designed to process and act upon visual and textual information across digital and physical environments. It targets researchers and developers building AI agents capable of complex, goal-driven interactions, offering versatile capabilities from UI navigation to robotics manipulation.

How It Works

Magma employs a unified pretraining framework that integrates text, image, and action modalities. It leverages large-scale, heterogeneous training data, including unlabeled videos from the wild and existing agentic datasets. Novel pretraining objectives, "Set-of-Mark" and "Trace-of-Mark," are introduced to bridge the gap between modalities, fostering cross-modal alignment and enabling long-horizon action prediction and planning.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e .. Additional packages for training and agents are available via pip install -e ".[train]" and pip install -e ".[agent]".
Prerequisites: Python 3.10+, PyTorch, Transformers (>=4.49.0, with a specific bug-fix for ConvNext backbone required: pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2). Dependencies like Co-tracker and kmeans_pytorch also need to be installed from source. CUDA-enabled GPU is recommended for inference.
Resources: Inference with Magma-8B in bfloat16 requires ~17GB peak memory, while 4-bit quantization reduces this to ~7GB.
Links: Project Page, arXiv Paper, Hugging Face Model.

Highlighted Details

State-of-the-art performance on UI navigation, robotics manipulation, and general image/video understanding.
Achieves strong spatial understanding and reasoning capabilities.
Scalable pretraining strategy using unlabeled videos and agentic data.
Offers demos for UI agents, gaming agents, and robot visual planning.

Maintenance & Community

The project is led by Microsoft Research. The README indicates ongoing development with a list of planned releases, including more pretraining data and finetuning scripts. Community interaction channels are not explicitly mentioned.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and closed-source linking. A Contributor License Agreement (CLA) is required for contributions.

Limitations & Caveats

The model requires a specific, customized version of the Transformers library due to a bug related to the ConvNext backbone. Some agent demos may require specific older versions of libraries or have known issues (e.g., robot visual planning demo). The model is intended for research purposes and requires careful evaluation for accuracy, safety, and fairness in downstream applications.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days