LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

398 stars

Top 72.6% on SourcePulse

Project Summary

LongVA is a multimodal large language model designed to process exceptionally long visual contexts, enabling zero-shot transfer of long-context capabilities from language to vision. It is suitable for researchers and practitioners working with extended video or image sequences, offering state-of-the-art performance on benchmarks like Video-MME.

How It Works

LongVA leverages a novel approach to handle thousands of visual tokens by integrating a specialized vision encoder with a large language model. It processes long sequences by sampling frames from videos or segments from images, tokenizing them, and feeding them into the LLM. This method allows for efficient processing of up to 2000 frames or over 200K visual tokens, maintaining context over extended durations.

Quick Start & Requirements

Installation: Requires CUDA 11.8 and an A100-SXM-80G GPU. Installation involves creating a Conda environment, installing PyTorch 2.1.2 with CUDA 11.8 support, and then installing the LongVA package with training dependencies. Flash-attn v2.5.0 is also required.
Demo: Local CLI inference and Gradio UI demos are provided.
Hugging Face: Example code for image and video processing using the lmms-lab/LongVA-7B-DPO model is available.
Evaluation: Supports evaluation via lmms-eval for both image and video tasks.

Highlighted Details

Processes up to 2000 frames or over 200K visual tokens.
Achieves state-of-the-art performance on Video-MME among 7B models.
Offers zero-shot transfer of long-context capabilities from language to vision.
Training code for vision-text alignment is available.

Maintenance & Community

The project is actively developed, with recent updates including the release of training code for vision-text alignment. Further details and community interaction can be found via the provided blog and Hugging Face links.

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

The codebase is tested on specific hardware (CUDA 11.8, A100-SXM-80G), and performance may vary on different configurations. Processing very long sequences requires significant GPU memory.

LongVA by EvolvingLMMs-Lab

Explore Similar Projects

VideoChat-Flash by OpenGVLab

VideoGPT-plus by mbzuai-oryx

Video-MME by MME-Benchmarks

tarsier by bytedance

Long-VITA by VITA-MLLM

Eagle by NVlabs

VideoLLaMA3 by DAMO-NLP-SG

Emu3 by baaivision

Video-ChatGPT by mbzuai-oryx

CogVLM2 by zai-org

InternLM-XComposer by InternLM

LWM by LargeWorldModel