VLM-FO1 by om-ai-lab

Enhancing VLMs with fine-grained perception

Created 9 months ago

327 stars

Top 83.1% on SourcePulse

Project Summary

VLM-FO1 addresses the challenge of endowing Vision-Language Models (VLMs) with superior fine-grained perception capabilities without compromising their existing high-level reasoning. It offers a plug-and-play module for researchers and developers, enabling the creation of next-generation perception-aware models adept at tasks like object grounding, region understanding, and visual region reasoning.

How It Works

The framework utilizes a plug-and-play modular design, allowing seamless integration with any pre-trained VLM while preserving its original weights. At its core is the novel Hybrid Region Encoder (HFRE), a Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features. This creates powerful region tokens capturing both high-level meaning and fine-grained spatial detail. A two-stage training strategy ensures the acquisition of fine-grained perception without causing catastrophic forgetting of the base model's general visual understanding abilities.

Quick Start & Requirements

Installation: Clone the repository, navigate to the directory, and install dependencies using pip install -r requirements.txt. Conda environment setup is also supported.
Prerequisites: Python 3.10+, PyTorch (GPU recommended, CUDA-enabled build), and Linux as the primary tested platform.
Demos: Multiple Gradio demos are available for inference, including integrations with SAM3 for segmentation and video tracking, and with UPN (or alternatives) for object proposal generation.
Links: Pre-trained checkpoints are available on Hugging Face. SAM3 setup requires following its official guide.

Highlighted Details

Achieves State-of-the-Art (SOTA) performance across a diverse suite of benchmarks, including COCO mAP, CountBench, Pixmo-Count, HumanRef, LVIS, PACO, and COCOText.
Features a plug-and-play modularity that preserves base VLM capabilities.
Introduces a Hybrid Region Encoder (HFRE) for fusing semantic and fine-grained perceptual features.
Integrates with SAM3 for enhanced segmentation fidelity and reliable detection under compositional prompts, and with UPN (or similar detectors) for generating object proposals.

Maintenance & Community

The README does not provide specific details regarding notable contributors, sponsorships, partnerships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The original UPN object detector referenced in the paper is not publicly released due to company policy; users must integrate an alternative detector like ChatRex's UPN or their own compatible solution.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days