Vision-language survey paper with curated list of foundational CV models
Top 61.1% on sourcepulse
This repository serves as a curated list of foundational models in computer vision, supplementing a survey paper on the topic. It aims to provide researchers and practitioners with a comprehensive overview of emerging vision models that leverage multimodal data and large-scale training for enhanced reasoning, generalization, and prompt capabilities.
How It Works
The repository organizes foundational models based on their architectural designs, training objectives (contrastive, generative), pre-training datasets, and prompting patterns (textual, visual, heterogeneous). It highlights models that bridge modalities like vision, text, and audio, enabling capabilities such as zero-shot learning and prompt-based manipulation of visual outputs.
Quick Start & Requirements
This repository is a collection of links and information about foundational models, not a runnable codebase itself. Users are directed to individual project pages for installation and usage instructions.
Highlighted Details
Maintenance & Community
The repository is associated with a survey paper accepted for publication by TPAMI. It encourages contributions via pull requests for relevant new works.
Licensing & Compatibility
The licensing of individual models linked within this repository varies. Users should consult the specific licenses of each project.
Limitations & Caveats
This repository is a curated list and does not provide direct code execution or support. Users must refer to the individual project pages for model-specific details and functionality.
8 months ago
1 day