Mirage by UMass-Embodied-AGI

Machine mental imagery for multimodal reasoning

Created 1 year ago

292 stars

Top 90.2% on SourcePulse

Project Summary

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

UMass-Embodied-AGI/Mirage introduces "Machine Mental Imagery," a novel approach to enhance multimodal reasoning. It tackles diverse reasoning tasks by interleaving compact latent visual tokens with explicit text tokens, aiming to boost performance without the computational overhead of full pixel-level image generation. This project targets researchers and engineers seeking more efficient multimodal AI solutions.

How It Works

Mirage represents visual information through latent visual tokens, compact representations of imagery features. These tokens are interleaved with text tokens, allowing holistic multimodal input processing and reasoning. This design grounds visual concepts in a latent space, bypassing computationally intensive pixel-level image synthesis during inference for improved reasoning.

Quick Start & Requirements

Installation requires a Conda environment with Python 3.10, cloning the repo, and installing dependencies via pip install -r requirements.txt and pip install -e ./transformers/.. Data preparation involves JSON formatting and extracting provided tarballs for tasks like VSP spatial planning. Training is a two-stage process using python src/main.py with Qwen2.5-VL-7B-Instruct as a base model. Inference uses python src/test.py. The project is associated with CVPR 2026 publication arXiv:2506.17218.

Highlighted Details

Novel "Machine Mental Imagery" paradigm using latent visual tokens for multimodal reasoning.
Achieves boosted reasoning performance without full pixel-level image generation.
Associated with a CVPR 2026 publication.
Supports tasks like VSP spatial planning.

Maintenance & Community

The README does not detail community channels (e.g., Discord, Slack), roadmap updates, or sponsorships. Maintenance and development are indicated by the release of code, data, and model weights for specific tasks.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification regarding its terms of use, distribution, and compatibility for commercial or closed-source applications.

Limitations & Caveats

Model checkpoints are currently available only for the VSP spatial planning task without Chain-of-Thought (CoT). The team plans to release updated model weights and expand the dataset to enhance performance further.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days