Magma  by microsoft

Multimodal AI agent foundation model research paper

created 8 months ago
1,765 stars

Top 24.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Magma is a foundation model for multimodal AI agents, designed to process and act upon visual and textual information across digital and physical environments. It targets researchers and developers building AI agents capable of complex, goal-driven interactions, offering versatile capabilities from UI navigation to robotics manipulation.

How It Works

Magma employs a unified pretraining framework that integrates text, image, and action modalities. It leverages large-scale, heterogeneous training data, including unlabeled videos from the wild and existing agentic datasets. Novel pretraining objectives, "Set-of-Mark" and "Trace-of-Mark," are introduced to bridge the gap between modalities, fostering cross-modal alignment and enabling long-horizon action prediction and planning.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e .. Additional packages for training and agents are available via pip install -e ".[train]" and pip install -e ".[agent]".
  • Prerequisites: Python 3.10+, PyTorch, Transformers (>=4.49.0, with a specific bug-fix for ConvNext backbone required: pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2). Dependencies like Co-tracker and kmeans_pytorch also need to be installed from source. CUDA-enabled GPU is recommended for inference.
  • Resources: Inference with Magma-8B in bfloat16 requires ~17GB peak memory, while 4-bit quantization reduces this to ~7GB.
  • Links: Project Page, arXiv Paper, Hugging Face Model.

Highlighted Details

  • State-of-the-art performance on UI navigation, robotics manipulation, and general image/video understanding.
  • Achieves strong spatial understanding and reasoning capabilities.
  • Scalable pretraining strategy using unlabeled videos and agentic data.
  • Offers demos for UI agents, gaming agents, and robot visual planning.

Maintenance & Community

The project is led by Microsoft Research. The README indicates ongoing development with a list of planned releases, including more pretraining data and finetuning scripts. Community interaction channels are not explicitly mentioned.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and closed-source linking. A Contributor License Agreement (CLA) is required for contributions.

Limitations & Caveats

The model requires a specific, customized version of the Transformers library due to a bug related to the ConvNext backbone. Some agent demos may require specific older versions of libraries or have known issues (e.g., robot visual planning demo). The model is intended for research purposes and requires careful evaluation for accuracy, safety, and fairness in downstream applications.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
153 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Toran Bruce Richards Toran Bruce Richards(Founder of AutoGPT), and
2 more.

OS-Copilot by OS-Copilot

0.1%
2k
OS agent for automating daily tasks
created 1 year ago
updated 10 months ago
Feedback? Help us improve.