VLM-FO1  by om-ai-lab

Enhancing VLMs with fine-grained perception

Created 6 months ago
290 stars

Top 90.8% on SourcePulse

GitHubView on GitHub
Project Summary

VLM-FO1 addresses the challenge of endowing Vision-Language Models (VLMs) with superior fine-grained perception capabilities without compromising their existing high-level reasoning. It offers a plug-and-play module for researchers and developers, enabling the creation of next-generation perception-aware models adept at tasks like object grounding, region understanding, and visual region reasoning.

How It Works

The framework utilizes a plug-and-play modular design, allowing seamless integration with any pre-trained VLM while preserving its original weights. At its core is the novel Hybrid Region Encoder (HFRE), a Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features. This creates powerful region tokens capturing both high-level meaning and fine-grained spatial detail. A two-stage training strategy ensures the acquisition of fine-grained perception without causing catastrophic forgetting of the base model's general visual understanding abilities.

Quick Start & Requirements

  • Installation: Clone the repository, navigate to the directory, and install dependencies using pip install -r requirements.txt. Conda environment setup is also supported.
  • Prerequisites: Python 3.10+, PyTorch (GPU recommended, CUDA-enabled build), and Linux as the primary tested platform.
  • Demos: Multiple Gradio demos are available for inference, including integrations with SAM3 for segmentation and video tracking, and with UPN (or alternatives) for object proposal generation.
  • Links: Pre-trained checkpoints are available on Hugging Face. SAM3 setup requires following its official guide.

Highlighted Details

  • Achieves State-of-the-Art (SOTA) performance across a diverse suite of benchmarks, including COCO mAP, CountBench, Pixmo-Count, HumanRef, LVIS, PACO, and COCOText.
  • Features a plug-and-play modularity that preserves base VLM capabilities.
  • Introduces a Hybrid Region Encoder (HFRE) for fusing semantic and fine-grained perceptual features.
  • Integrates with SAM3 for enhanced segmentation fidelity and reliable detection under compositional prompts, and with UPN (or similar detectors) for generating object proposals.

Maintenance & Community

The README does not provide specific details regarding notable contributors, sponsorships, partnerships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The original UPN object detector referenced in the paper is not publicly released due to company policy; users must integrate an alternative detector like ChatRex's UPN or their own compatible solution.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
49 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.