CoVT by Wakals

Enabling VLMs to reason in continuous visual space

Created 4 months ago

300 stars

Top 89.0% on SourcePulse

Project Summary

Chain-of-Visual-Thought (CoVT) enhances Vision-Language Models (VLMs) by enabling reasoning in continuous visual space, addressing limitations in perceptual understanding. It introduces "continuous visual tokens" encoding rich perceptual cues, bridging language and vision for improved fine-grained understanding, spatial precision, and geometric awareness.

How It Works

CoVT equips VLMs with compact latent "continuous visual tokens" derived from lightweight vision experts (e.g., SAM, DepthAnything). During training, the VLM autoregressively predicts these tokens to reconstruct dense perceptual signals (segmentation, depth, edges). At inference, reasoning occurs directly in this efficient latent visual-token space, preserving resources and enabling more precise, grounded, and interpretable multimodal intelligence.

Quick Start & Requirements

Evaluation uses a forked VLMEvalKit. An interactive Gradio demo is available. Training data, code, and finetuned weights have been released. Setup involves integrating with base VLMs (e.g., Qwen2.5-VL, LLaVA) and likely requires GPU resources.

Project Page: https://wakalsprojectpage.covt-website/
Hugging Face Collection: https://huggingface.co/collections/Wakals/covt-chain-of-visual-thought
Arxiv: https://arxiv.org/abs/2511.19418

Highlighted Details

CoVT consistently improves VLM performance by 3-16% across over ten perception benchmarks (e.g., CV-Bench, MMVP) when integrated with models like Qwen2.5-VL and LLaVA. Each visual token type (segment, depth, DINO, edge) shows task-specific effectiveness. The framework uses a small budget of ~20 tokens for visual thought and can optionally decode them into dense, interpretable predictions.

Maintenance & Community

Key releases (data, code, evaluation, weights) occurred on 2025-11-24. Future work includes supporting Hugging Face demos and more VLMs. Contact: ymk4474@gmail.com, xdwang@eecs.berkeley.edu; GitHub issues recommended for implementation problems. No explicit community channels listed.

Licensing & Compatibility

The majority is Apache Licensed, but components like PIDINet use their own licenses, requiring careful review for compatibility, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

Newly released with ongoing development tasks (e.g., expanding VLM support). The mixed licensing, particularly PIDINet's terms, necessitates thorough vetting for adoption.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days