Thinking-with-Visual-Primitives by ailuntx

Multimodal reasoning model tackles spatial ambiguity

Created 1 month ago

251 stars

Top 99.8% on SourcePulse

Project Summary

This project addresses the "Reference Gap" in multimodal large language models (MLLMs) by introducing a novel "Point-to-Reason Synergy" paradigm. It enables precise spatial reasoning by interleaving visual primitives (points, bounding boxes) into the model's thought process, targeting researchers and developers working on complex structural and topological reasoning tasks. The benefit lies in overcoming linguistic ambiguity for accurate spatial understanding, achieving competitive performance with significantly reduced visual token usage.

How It Works

The core innovation is the "Point-to-Reason Synergy," which treats visual primitives as minimal units of thought, directly anchoring abstract language to concrete spatial coordinates. This mimics human cognitive processes for tasks requiring precise spatial referencing. The architecture, built upon DeepSeek-V4-Flash, achieves "Extreme Visual Token Efficiency" by compressing KV cache for visual tokens, drastically reducing computational load while maintaining reasoning depth.

Quick Start & Requirements

This repository is an archived snapshot of an unavailable original source. No direct installation or quick-start commands are provided. Users should follow updates from the charlesCXK GitHub profile (https://github.com/charlesCXK) or the DeepSeek organization (https://github.com/deepseek-ai), though both sources are currently reported as unavailable as of May 22, 2026.

Highlighted Details

Achieves "Frontier-Competitive Performance" matching models like GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash on specific counting and spatial reasoning benchmarks, despite a compact model scale and lower image-token budget.
"Extreme Visual Token Efficiency" compresses KV cache for every 4 visual tokens into a single entry, significantly reducing memory and computation.
Introduces "Point-to-Reason Synergy" to solve the "Reference Gap" in MLLMs, enabling precise spatial reasoning through visual primitives.

Maintenance & Community

The project is an archived snapshot, and the original source repository is unavailable. As of May 22, 2026, no official replacement repository or re-release has been found. Future updates are to be monitored via the charlesCXK profile and DeepSeek organization. Contact is available via email at service@deepseek.com or by raising an issue.

Licensing & Compatibility

The code repository is licensed under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

This repository is an archived snapshot and not an authoritative source. The original upstream repository and associated DeepSeek organization repository are currently unavailable, with no known replacement. The reported benchmark scores cover only a subset of evaluation dimensions relevant to the paper's focus and are not indicative of overall model capabilities.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

213 stars in the last 30 days