Research paper for parameter-free LLaVA extension to videos
Top 51.6% on sourcepulse
PLLaVA extends existing image-language models to video data for tasks like video dense captioning, targeting researchers and developers. It offers a parameter-free approach to adapt image models for video, achieving state-of-the-art results on benchmarks like Video ChatGPT and MVBench by employing a novel temporal pooling strategy to mitigate feature saturation.
How It Works
PLLaVA addresses the computational and data demands of video-language pre-training by adapting image-language models. It introduces a simple pooling strategy that smooths feature distributions across the temporal dimension, reducing the impact of dominant "extreme tokens" in video frames. This parameter-free extension allows existing image models to be fine-tuned for video tasks more efficiently and effectively, particularly for captioning.
Quick Start & Requirements
pip install -r requirements.txt
(after installing PyTorch with CUDA support).llava-hf/llava-v1.6-vicuna-7b-hf
).bash scripts/demo.sh <model_dir> <weights_dir>
Highlighted Details
transformers
and accelerate
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day