perception_models  by facebookresearch

Image/video perception models and multimodal LLMs

created 3 months ago
1,467 stars

Top 28.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides state-of-the-art Perception Encoders (PE) for image and video processing and Perception Language Models (PLM) for multimodal understanding. It targets researchers and developers needing high-performance vision and vision-language models, offering specialized checkpoints for various downstream tasks and enabling reproducible research with open datasets.

How It Works

The Perception Encoder (PE) is a family of vision encoders designed for scalable contrastive pretraining. It offers three specialized variants: PE-Core for general vision-language tasks, PE-Lang for multimodal LLMs, and PE-Spatial for dense prediction tasks. The Perception Language Model (PLM) leverages PE-Lang and open-source LLMs (Llama variants) to achieve state-of-the-art performance on vision-language benchmarks, including novel large-scale video datasets.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (conda create --name perception_models python=3.12), activate it, install PyTorch (pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124), install ffmpeg (conda install ffmpeg -c conda-forge), install torchcodec (pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124), and install the package (pip install -e .).
  • Prerequisites: Python 3.12, PyTorch 2.5.1, CUDA 12.4 (for GPU acceleration), ffmpeg.
  • Resources: Requires significant GPU resources for training and inference, especially for larger models.
  • Demo: A Colab demo is available for image/text feature extraction. See apps/pe/README.md and apps/plm/README.md for detailed setup.

Highlighted Details

  • PE-Core outperforms SigLIP2 on image and InternVideo2 on video benchmarks.
  • PE-Lang competes with QwenVL2.5 and InternVL3 on multimodal LLM tasks.
  • PE-Spatial surpasses DINOv2 on dense prediction tasks.
  • PLM models are available in 1B, 3B, and 8B parameter sizes, trained on open data.

Maintenance & Community

  • Developed by Facebook Research.
  • Code structure and LLM implementation are forked from Meta Lingua.
  • Acknowledgements to Open_CLIP and CLIP_benchmark.
  • Links to evaluation and training documentation are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. However, given its origin from Facebook Research and the mention of "open-source contributions," it is likely to be under a permissive license. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial adoption.
  • Detailed setup and usage instructions are spread across multiple README files within subdirectories.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
17
Star History
740 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.