perception_models by facebookresearch

Image/video perception models and multimodal LLMs

Created 9 months ago

2,068 stars

Top 21.4% on SourcePulse

Project Summary

This repository provides state-of-the-art Perception Encoders (PE) for image and video processing and Perception Language Models (PLM) for multimodal understanding. It targets researchers and developers needing high-performance vision and vision-language models, offering specialized checkpoints for various downstream tasks and enabling reproducible research with open datasets.

How It Works

The Perception Encoder (PE) is a family of vision encoders designed for scalable contrastive pretraining. It offers three specialized variants: PE-Core for general vision-language tasks, PE-Lang for multimodal LLMs, and PE-Spatial for dense prediction tasks. The Perception Language Model (PLM) leverages PE-Lang and open-source LLMs (Llama variants) to achieve state-of-the-art performance on vision-language benchmarks, including novel large-scale video datasets.

Quick Start & Requirements

Install: Clone the repo, create a conda environment (conda create --name perception_models python=3.12), activate it, install PyTorch (pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124), install ffmpeg (conda install ffmpeg -c conda-forge), install torchcodec (pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124), and install the package (pip install -e .).
Prerequisites: Python 3.12, PyTorch 2.5.1, CUDA 12.4 (for GPU acceleration), ffmpeg.
Resources: Requires significant GPU resources for training and inference, especially for larger models.
Demo: A Colab demo is available for image/text feature extraction. See apps/pe/README.md and apps/plm/README.md for detailed setup.

Highlighted Details

PE-Core outperforms SigLIP2 on image and InternVideo2 on video benchmarks.
PE-Lang competes with QwenVL2.5 and InternVL3 on multimodal LLM tasks.
PE-Spatial surpasses DINOv2 on dense prediction tasks.
PLM models are available in 1B, 3B, and 8B parameter sizes, trained on open data.

Maintenance & Community

Developed by Facebook Research.
Code structure and LLM implementation are forked from Meta Lingua.
Acknowledgements to Open_CLIP and CLIP_benchmark.
Links to evaluation and training documentation are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, given its origin from Facebook Research and the mention of "open-source contributions," it is likely to be under a permissive license. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption.
Detailed setup and usage instructions are spread across multiple README files within subdirectories.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

7

Star History

260 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

MiraData by mira-space

Video dataset for long video generation research

Created 1 year ago

Updated 1 year ago

Valley by bytedance

Advanced multimodal LLM for text, image, and video

Created 1 year ago

Updated 1 month ago

Starred by

Saining Xie

Saining Xie(Professor at NYU).

cambrian-s by cambrian-mllm

Multimodal LLM for advanced video spatial understanding

Created 3 months ago

Updated 2 weeks ago

RynnEC by alibaba-damo-academy

Video MLLM for embodied cognition

Created 5 months ago

Updated 2 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

MotionLLM by IDEA-Research

MotionLLM: Research paper for multimodal human behavior understanding

Created 1 year ago

Updated 1 year ago

VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

Created 1 year ago

Updated 5 months ago

LaViLa by facebookresearch

Video pretraining research paper using LLMs

Created 3 years ago

Updated 2 years ago

Chat-UniVi by PKU-YuanGroup

Research paper for multimodal image and video understanding with LLMs

Created 2 years ago

Updated 1 year ago

VideoLLaMA2 by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

Created 1 year ago

Updated 11 months ago

Sa2VA by bytedance

Multimodal model for dense grounded image/video understanding

Created 1 year ago

Updated 1 day ago

Awesome-LLMs-for-Video-Understanding by yunlong10

Survey of video understanding via LLMs

Created 2 years ago

Updated 3 weeks ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM).

LLaVA-NeXT by LLaVA-VL

Multimodal model for image, video, and 3D understanding

Created 1 year ago

Updated 3 months ago

Feedback? Help us improve.