Vision-Language-Models-Overview by zli12321

Survey of Vision-Language Models (VLMs)

Created 1 year ago

493 stars

Top 62.8% on SourcePulse

Project Summary

This repository provides a comprehensive, frontend-curated survey of Vision-Language Models (VLMs), covering state-of-the-art models, benchmarks, post-training techniques, applications, and challenges. It serves researchers and practitioners by consolidating information on VLMs, offering a structured overview of the rapidly evolving field.

How It Works

The project acts as a curated knowledge base, meticulously organizing links to papers, GitHub repositories, and datasets. It categorizes VLMs by architecture and training data, lists evaluation benchmarks with their metrics and sources, and details post-training methods like RL alignment and prompt engineering. The structure facilitates easy navigation through the complex landscape of VLM research and development.

Quick Start & Requirements

This repository is a collection of links and does not require installation or execution. It serves as a reference guide.

Highlighted Details

Comprehensive tables detail over 30 state-of-the-art VLMs, including their architectures, training data, and parameter counts.
An extensive list of over 50 benchmark datasets and simulators covers diverse VLM evaluation tasks, from visual reasoning to embodied AI.
Detailed sections on post-training methods highlight Reinforcement Learning (RL) alignment techniques and prompt engineering strategies.
Applications are categorized across robotics, embodied AI, generative visual media, and human-centered AI, showcasing real-world VLM use cases.

Maintenance & Community

The repository is actively maintained, with papers marked with a star indicating contributions from the maintainers. Users are encouraged to contribute and discuss via the GitHub repository.

Licensing & Compatibility

The repository itself is a collection of links to external resources, each with its own licensing. The project does not impose specific licensing restrictions beyond those of the linked content.

Limitations & Caveats

As a survey, this repository does not provide executable code or models. The information is a snapshot of the field, and the rapid pace of VLM development means some details may become outdated.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days