Discover and explore top open-source AI tools and projects—updated daily.
facebookresearchImage/video perception models and multimodal LLMs
Top 24.7% on SourcePulse
This repository provides state-of-the-art Perception Encoders (PE) for image and video processing and Perception Language Models (PLM) for multimodal understanding. It targets researchers and developers needing high-performance vision and vision-language models, offering specialized checkpoints for various downstream tasks and enabling reproducible research with open datasets.
How It Works
The Perception Encoder (PE) is a family of vision encoders designed for scalable contrastive pretraining. It offers three specialized variants: PE-Core for general vision-language tasks, PE-Lang for multimodal LLMs, and PE-Spatial for dense prediction tasks. The Perception Language Model (PLM) leverages PE-Lang and open-source LLMs (Llama variants) to achieve state-of-the-art performance on vision-language benchmarks, including novel large-scale video datasets.
Quick Start & Requirements
conda create --name perception_models python=3.12), activate it, install PyTorch (pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124), install ffmpeg (conda install ffmpeg -c conda-forge), install torchcodec (pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124), and install the package (pip install -e .).apps/pe/README.md and apps/plm/README.md for detailed setup.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day