MA-LMM by boheumd

Research paper for long-term video understanding using memory-augmented multimodal model

Created 1 year ago

344 stars

Top 80.4% on SourcePulse

Project Summary

MA-LMM addresses long-term video understanding by introducing a memory-augmented large multimodal model. It targets researchers and practitioners in video analysis, offering a plug-and-play module for enhanced zero-shot video understanding tasks like question answering and summarization.

How It Works

MA-LMM integrates a memory bank into a large multimodal model (LMM), specifically building upon InstructBLIP. This memory bank stores and retrieves relevant information from long video sequences, enabling the model to maintain context over extended durations. The core innovation lies in its memory compression algorithm, which efficiently manages the memory bank's size and content, allowing for effective long-term temporal reasoning without requiring extensive fine-tuning.

Quick Start & Requirements

Install: pip install -e . (after git clone and cd MA-LMM)
Prerequisites: Python, contexttimer, eva-decord (or decord for non-Apple Silicon), einops>=0.4.1, fairscale==0.4.4. Requires Vicuna-v1.1 LLM weights. Datasets (LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, Youcook2) need to be downloaded and frames extracted (e.g., using extract_frames.py).
Resources: Fine-tuning requires four A100 GPUs.
Links: Project Page, Demo

Highlighted Details

Plug-and-play module for InstructBLIP, enabling zero-shot evaluation without fine-tuning.
Supports long-term video understanding tasks including video question answering and summarization.
Memory bank compression algorithm for efficient long-term context management.
Experiments conducted on LVU, Breakfast, COIN, MSRVTT, MSVD, ActivityNet, and Youcook2 datasets.

Maintenance & Community

Code references LAVIS.
No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes.

Limitations & Caveats

The project is associated with a 2024 CVPR paper, suggesting it is research-oriented and may not be production-ready. Frame extraction may require adjustments due to FFmpeg version inconsistencies.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Open-R1-Video by Wang-Xiaodong1899

Video-LLM for video understanding tasks, inspired by the R1 paradigm

Created 10 months ago

Updated 10 months ago

MIC by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago

Updated 2 years ago

Awesome-Multimodal-LLM by HenryHZY

Collection of research trends in LLM-guided multimodal learning

Created 2 years ago

Updated 2 years ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

llava-phi by xmoanvaf

Multimodal assistant with small language models

Created 2 years ago

Updated 1 year ago

videollm-online by showlab

Streaming video LLM for online interaction within a video stream

Created 1 year ago

Updated 1 month ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

Updated 1 year ago

TinyLLaVA_Factory by TinyLLaVA

Framework for small-scale large multimodal models research

Created 1 year ago

Updated 8 months ago

Awesome-LLMs-for-Video-Understanding by yunlong10

Survey of video understanding via LLMs

Created 2 years ago

Updated 3 weeks ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

Updated 8 months ago

Starred by

Matei Zaharia

Matei Zaharia(Cofounder of Databricks),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

9 more.

LWM by LargeWorldModel

Multimodal autoregressive model for long-context video/text

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.