Image/video perception models and multimodal LLMs
Top 28.5% on sourcepulse
This repository provides state-of-the-art Perception Encoders (PE) for image and video processing and Perception Language Models (PLM) for multimodal understanding. It targets researchers and developers needing high-performance vision and vision-language models, offering specialized checkpoints for various downstream tasks and enabling reproducible research with open datasets.
How It Works
The Perception Encoder (PE) is a family of vision encoders designed for scalable contrastive pretraining. It offers three specialized variants: PE-Core for general vision-language tasks, PE-Lang for multimodal LLMs, and PE-Spatial for dense prediction tasks. The Perception Language Model (PLM) leverages PE-Lang and open-source LLMs (Llama variants) to achieve state-of-the-art performance on vision-language benchmarks, including novel large-scale video datasets.
Quick Start & Requirements
conda create --name perception_models python=3.12
), activate it, install PyTorch (pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
), install ffmpeg
(conda install ffmpeg -c conda-forge
), install torchcodec
(pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
), and install the package (pip install -e .
).apps/pe/README.md
and apps/plm/README.md
for detailed setup.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 weeks ago
1 day