LongVA  by EvolvingLMMs-Lab

Vision-language model for long context understanding

created 1 year ago
387 stars

Top 75.2% on sourcepulse

GitHubView on GitHub
Project Summary

LongVA is a multimodal large language model designed to process exceptionally long visual contexts, enabling zero-shot transfer of long-context capabilities from language to vision. It is suitable for researchers and practitioners working with extended video or image sequences, offering state-of-the-art performance on benchmarks like Video-MME.

How It Works

LongVA leverages a novel approach to handle thousands of visual tokens by integrating a specialized vision encoder with a large language model. It processes long sequences by sampling frames from videos or segments from images, tokenizing them, and feeding them into the LLM. This method allows for efficient processing of up to 2000 frames or over 200K visual tokens, maintaining context over extended durations.

Quick Start & Requirements

  • Installation: Requires CUDA 11.8 and an A100-SXM-80G GPU. Installation involves creating a Conda environment, installing PyTorch 2.1.2 with CUDA 11.8 support, and then installing the LongVA package with training dependencies. Flash-attn v2.5.0 is also required.
  • Demo: Local CLI inference and Gradio UI demos are provided.
  • Hugging Face: Example code for image and video processing using the lmms-lab/LongVA-7B-DPO model is available.
  • Evaluation: Supports evaluation via lmms-eval for both image and video tasks.

Highlighted Details

  • Processes up to 2000 frames or over 200K visual tokens.
  • Achieves state-of-the-art performance on Video-MME among 7B models.
  • Offers zero-shot transfer of long-context capabilities from language to vision.
  • Training code for vision-text alignment is available.

Maintenance & Community

The project is actively developed, with recent updates including the release of training code for vision-text alignment. Further details and community interaction can be found via the provided blog and Hugging Face links.

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

The codebase is tested on specific hardware (CUDA 11.8, A100-SXM-80G), and performance may vary on different configurations. Processing very long sequences requires significant GPU memory.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.