VLM_survey by jingyi0000

VLM survey paper with links to models/methods for vision tasks

Created 2 years ago

3,051 stars

Top 15.6% on SourcePulse

1 Expert Loves This Project

Ying1123

Coauthor of SGLang

Project Summary

This repository serves as a comprehensive survey of Vision-Language Models (VLMs) applied to various visual recognition tasks, including image classification, object detection, and semantic segmentation. It targets researchers and practitioners in computer vision and natural language processing, offering a structured overview of VLMs, their pre-training methods, transfer learning techniques, and knowledge distillation strategies. The project aims to consolidate and categorize the rapidly evolving field of VLMs for vision tasks.

How It Works

The repository is structured around a survey paper, "Vision-Language Models for Vision Tasks: A Survey," which systematically categorizes VLMs based on their application in visual recognition. It details pre-training methodologies (contrastive, generative, alignment), transfer learning approaches (prompt tuning, adapters), and knowledge distillation techniques. The survey also lists relevant datasets for both pre-training and evaluation across various vision tasks.

Quick Start & Requirements

This repository is a curated list of papers and does not have a direct installation or execution command. It requires no specific software to view.

Highlighted Details

Comprehensive categorization of VLM pre-training methods, transfer learning techniques, and knowledge distillation strategies.
Extensive lists of datasets used for VLM pre-training and evaluation across image classification, object detection, semantic segmentation, and more.
Includes links to papers and code repositories for each listed VLM method.
Features recent advancements, with many papers from NeurIPS 2024 and CVPR 2024.

Maintenance & Community

The project is maintained by jingyi0000 and welcomes contributions via pull requests for missing papers. The last update was on March 24, 2025.

Licensing & Compatibility

The repository itself does not specify a license. The linked papers and code repositories will have their own respective licenses.

Limitations & Caveats

This repository is a survey and does not provide executable code or models. Its value is in its curated information and links to external resources.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

28 stars in the last 30 days

Explore Similar Projects

Q-Bench by Q-Future

Benchmark for multi-modality LLMs (MLLMs) on low-level vision tasks

Created 2 years ago

Updated 1 year ago

Awesome-Parameter-Efficient-Transfer-Learning by synbol

Resource list for parameter-efficient transfer learning

Created 3 years ago

Updated 1 month ago

Awesome-Parameter-Efficient-Transfer-Learning by jianghaojun

Paper list for parameter-efficient transfer learning in CV/multimodal

Created 3 years ago

Updated 1 year ago

seemore by AviSoori1x

Build and understand vision-language models from scratch

Created 1 year ago

Updated 1 year ago

Awesome-CV-MasterHub by cuixing158

CV paper list for recent computer vision research

Created 11 months ago

Updated 2 days ago

Starred by

Burkay Gur

Burkay Gur(Cofounder of Fal.ai).

awesome-vlm-architectures by gokayfem

Vision-language models and their architectures

Created 1 year ago

Updated 10 months ago

Starred by

Haotian Liu

Haotian Liu(Author of LLaVA; Research Scientist at xAI).

CVinW_Readings by Computer-Vision-in-the-Wild

CVinW Readings: a collection of papers on computer vision in the wild

Created 3 years ago

Updated 1 year ago

CVPR-2022-Papers by 52CV

CVPR 2022 papers, code, and resources

Created 4 years ago

Updated 3 years ago

LLaVA-CoT by PKU-YuanGroup

VLM research paper for step-by-step reasoning

Created 1 year ago

Updated 1 month ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

Updated 1 year ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento).

Vary by Ucas-HaoranWei

Vision-language model research paper implementation

Created 2 years ago

Updated 1 year ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face).

R1-V by StarsfieldAI

VLM research for reinforcing generalization with minimal cost

Created 11 months ago

Updated 7 months ago

Feedback? Help us improve.