Discover and explore top open-source AI tools and projects—updated daily.
UMass-Embodied-AGIMachine mental imagery for multimodal reasoning
Top 96.6% on SourcePulse
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
UMass-Embodied-AGI/Mirage introduces "Machine Mental Imagery," a novel approach to enhance multimodal reasoning. It tackles diverse reasoning tasks by interleaving compact latent visual tokens with explicit text tokens, aiming to boost performance without the computational overhead of full pixel-level image generation. This project targets researchers and engineers seeking more efficient multimodal AI solutions.
How It Works
Mirage represents visual information through latent visual tokens, compact representations of imagery features. These tokens are interleaved with text tokens, allowing holistic multimodal input processing and reasoning. This design grounds visual concepts in a latent space, bypassing computationally intensive pixel-level image synthesis during inference for improved reasoning.
Quick Start & Requirements
Installation requires a Conda environment with Python 3.10, cloning the repo, and installing dependencies via pip install -r requirements.txt and pip install -e ./transformers/.. Data preparation involves JSON formatting and extracting provided tarballs for tasks like VSP spatial planning. Training is a two-stage process using python src/main.py with Qwen2.5-VL-7B-Instruct as a base model. Inference uses python src/test.py. The project is associated with CVPR 2026 publication arXiv:2506.17218.
Highlighted Details
Maintenance & Community
The README does not detail community channels (e.g., Discord, Slack), roadmap updates, or sponsorships. Maintenance and development are indicated by the release of code, data, and model weights for specific tasks.
Licensing & Compatibility
The repository's README does not specify a software license. This omission requires clarification regarding its terms of use, distribution, and compatibility for commercial or closed-source applications.
Limitations & Caveats
Model checkpoints are currently available only for the VSP spatial planning task without Chain-of-Thought (CoT). The team plans to release updated model weights and expand the dataset to enhance performance further.
8 months ago
Inactive