perception_models  by facebookresearch

Image/video perception models and multimodal LLMs

Created 5 months ago
1,620 stars

Top 25.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides state-of-the-art Perception Encoders (PE) for image and video processing and Perception Language Models (PLM) for multimodal understanding. It targets researchers and developers needing high-performance vision and vision-language models, offering specialized checkpoints for various downstream tasks and enabling reproducible research with open datasets.

How It Works

The Perception Encoder (PE) is a family of vision encoders designed for scalable contrastive pretraining. It offers three specialized variants: PE-Core for general vision-language tasks, PE-Lang for multimodal LLMs, and PE-Spatial for dense prediction tasks. The Perception Language Model (PLM) leverages PE-Lang and open-source LLMs (Llama variants) to achieve state-of-the-art performance on vision-language benchmarks, including novel large-scale video datasets.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (conda create --name perception_models python=3.12), activate it, install PyTorch (pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124), install ffmpeg (conda install ffmpeg -c conda-forge), install torchcodec (pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124), and install the package (pip install -e .).
  • Prerequisites: Python 3.12, PyTorch 2.5.1, CUDA 12.4 (for GPU acceleration), ffmpeg.
  • Resources: Requires significant GPU resources for training and inference, especially for larger models.
  • Demo: A Colab demo is available for image/text feature extraction. See apps/pe/README.md and apps/plm/README.md for detailed setup.

Highlighted Details

  • PE-Core outperforms SigLIP2 on image and InternVideo2 on video benchmarks.
  • PE-Lang competes with QwenVL2.5 and InternVL3 on multimodal LLM tasks.
  • PE-Spatial surpasses DINOv2 on dense prediction tasks.
  • PLM models are available in 1B, 3B, and 8B parameter sizes, trained on open data.

Maintenance & Community

  • Developed by Facebook Research.
  • Code structure and LLM implementation are forked from Meta Lingua.
  • Acknowledgements to Open_CLIP and CLIP_benchmark.
  • Links to evaluation and training documentation are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. However, given its origin from Facebook Research and the mention of "open-source contributions," it is likely to be under a permissive license. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial adoption.
  • Detailed setup and usage instructions are spread across multiple README files within subdirectories.
Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
3
Star History
107 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.