PLLaVA by magic-research

Research paper for parameter-free LLaVA extension to videos

Created 1 year ago

677 stars

Top 50.1% on SourcePulse

1 Expert Loves This Project

pgarbacki

Cofounder of Fireworks AI

Project Summary

PLLaVA extends existing image-language models to video data for tasks like video dense captioning, targeting researchers and developers. It offers a parameter-free approach to adapt image models for video, achieving state-of-the-art results on benchmarks like Video ChatGPT and MVBench by employing a novel temporal pooling strategy to mitigate feature saturation.

How It Works

PLLaVA addresses the computational and data demands of video-language pre-training by adapting image-language models. It introduces a simple pooling strategy that smooths feature distributions across the temporal dimension, reducing the impact of dominant "extreme tokens" in video frames. This parameter-free extension allows existing image models to be fine-tuned for video tasks more efficiently and effectively, particularly for captioning.

Quick Start & Requirements

Install: pip install -r requirements.txt (after installing PyTorch with CUDA support).
Prerequisites: Python 3.10, PyTorch 2.2.1+cu118 or 2.2.1+cu122.
Model Download: Requires downloading base model weights from Hugging Face (e.g., llava-hf/llava-v1.6-vicuna-7b-hf).
Demo: bash scripts/demo.sh <model_dir> <weights_dir>
Docs: Usage, Data

Highlighted Details

Achieves SOTA on Video ChatGPT (3.48/5) and MVBench (58.1% accuracy), outperforming GPT-4V(IG-VLM).
Employs a temporal pooling strategy to improve video feature representation.
Supports 7B, 13B, and 34B parameter models.
Codebase built upon Videochat2, leveraging transformers and accelerate.

Maintenance & Community

Project is under active development and reconstruction.
Contributions and suggestions are welcomed.
Acknowledgements include LLaVA, VideoChatGPT, and VideoLLaVA.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README.
It is built upon other open-source projects, whose licenses would apply.

Limitations & Caveats

The repository is noted as undergoing development and reconstruction, with unoptimized response speed and frontend logic.
Specific data preparation instructions are linked but not detailed within the README.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Youku-mPLUG by X-PLUG

Chinese video-language dataset and benchmarks for pre-training

Created 2 years ago

Updated 2 years ago

ml-slowfast-llava by apple

Video understanding and reasoning with a training-free LLM

Created 1 year ago

Updated 1 year ago

dolphin by kaleido-lab

Video interaction platform based on LLMs

Created 2 years ago

Updated 2 years ago

Flash-VStream by IVGSZ

Real-time VLM for long video streams

Created 1 year ago

Updated 2 months ago

LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

Updated 9 months ago

VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

Created 1 year ago

Updated 5 months ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

MovieChat by rese1f

Video QA for long video understanding (CVPR 2024 paper)

Created 2 years ago

Updated 11 months ago

ShareGPT4Video by ShareGPT4Omni

Research paper for video understanding/generation via improved captions

Created 1 year ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Video-ChatGPT by mbzuai-oryx

Video conversation model for detailed video understanding (ACL 2024 paper)

Created 2 years ago

Updated 5 months ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.