Long-VITA by VITA-MLLM

Long-context visual language model for million-token processing

Created 1 year ago

305 stars

Top 87.9% on SourcePulse

Project Summary

Long-VITA is a large multi-modal model designed to process extremely long contexts, exceeding one million tokens, for both image and video understanding tasks. It targets researchers and developers working with extensive visual data, offering state-of-the-art performance on benchmarks like Video-MME for models under 20B parameters.

How It Works

Long-VITA achieves its long-context capabilities through an unspecified architectural innovation that enables processing over 1 million visual tokens. The model is trained on a dataset of 17 million publicly available samples, focusing on open-source data. It utilizes a Logits-Masked LM Head, which is highlighted as a key component for its effectiveness.

Quick Start & Requirements

Models: Available on Hugging Face, with weights for MindSpeed (Ascend NPU) and Megatron (Nvidia GPU) also provided.
Training/Inference: Supports Ascend NPU with MindSpeed, Nvidia GPU with Megatron, and Nvidia GPU with DeepSpeed.
Resources: Training and inference are GPU-intensive, with specific support for Nvidia and Ascend hardware.

Highlighted Details

Processes over 1 million visual tokens (4K frames).
Achieves state-of-the-art performance on Video-MME for models under 20B parameters.
Trained exclusively on 17 million open-source data samples.
Competitive results on image and video understanding benchmarks.

Maintenance & Community

The project has recently added an online demo and support in VLMEvalKit (OpenCompass). Training and inference code, logs, deployment code, and model weights are released.

Licensing & Compatibility

The README does not explicitly state the license type or any compatibility notes for commercial use.

Limitations & Caveats

The project's primary focus is on Ascend NPU and Nvidia GPU architectures, with specific framework support (MindSpeed, Megatron, DeepSpeed). Compatibility with other hardware or frameworks is not detailed.

Long-VITA by VITA-MLLM

Explore Similar Projects

Open-R1-Video by Wang-Xiaodong1899

VideoChat-Flash by OpenGVLab

VisionZip by JIA-Lab-research

Video-T1 by liuff19

LongVA by EvolvingLMMs-Lab

LLaVA-Mini by ictnlp

Eagle by NVlabs

LVM by ytongbai

SpargeAttn by thu-ml

FastVideo by hao-ai-lab

VILA by NVlabs

lmms-eval by EvolvingLMMs-Lab