Discover and explore top open-source AI tools and projects—updated daily.
WakalsEnabling VLMs to reason in continuous visual space
Top 99.6% on SourcePulse
Chain-of-Visual-Thought (CoVT) enhances Vision-Language Models (VLMs) by enabling reasoning in continuous visual space, addressing limitations in perceptual understanding. It introduces "continuous visual tokens" encoding rich perceptual cues, bridging language and vision for improved fine-grained understanding, spatial precision, and geometric awareness.
How It Works
CoVT equips VLMs with compact latent "continuous visual tokens" derived from lightweight vision experts (e.g., SAM, DepthAnything). During training, the VLM autoregressively predicts these tokens to reconstruct dense perceptual signals (segmentation, depth, edges). At inference, reasoning occurs directly in this efficient latent visual-token space, preserving resources and enabling more precise, grounded, and interpretable multimodal intelligence.
Quick Start & Requirements
Evaluation uses a forked VLMEvalKit. An interactive Gradio demo is available. Training data, code, and finetuned weights have been released. Setup involves integrating with base VLMs (e.g., Qwen2.5-VL, LLaVA) and likely requires GPU resources.
https://wakalsprojectpage.covt-website/https://huggingface.co/collections/Wakals/covt-chain-of-visual-thoughthttps://arxiv.org/abs/2511.19418Highlighted Details
CoVT consistently improves VLM performance by 3-16% across over ten perception benchmarks (e.g., CV-Bench, MMVP) when integrated with models like Qwen2.5-VL and LLaVA. Each visual token type (segment, depth, DINO, edge) shows task-specific effectiveness. The framework uses a small budget of ~20 tokens for visual thought and can optionally decode them into dense, interpretable predictions.
Maintenance & Community
Key releases (data, code, evaluation, weights) occurred on 2025-11-24. Future work includes supporting Hugging Face demos and more VLMs. Contact: ymk4474@gmail.com, xdwang@eecs.berkeley.edu; GitHub issues recommended for implementation problems. No explicit community channels listed.
Licensing & Compatibility
The majority is Apache Licensed, but components like PIDINet use their own licenses, requiring careful review for compatibility, especially for commercial use or integration into closed-source projects.
Limitations & Caveats
Newly released with ongoing development tasks (e.g., expanding VLM support). The mixed licensing, particularly PIDINet's terms, necessitates thorough vetting for adoption.
5 days ago
Inactive