LLaVA-NeXT by LLaVA-VL

Multimodal model for image, video, and 3D understanding

Created 1 year ago

4,492 stars

Top 10.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Lianmin Zheng

Coauthor of SGLang, vLLM

Project Summary

LLaVA-NeXT is an open-source project providing advanced Large Multimodal Models (LMMs) that excel in visual understanding across single images, multiple images, and videos. It targets researchers and developers seeking state-of-the-art performance in multimodal AI, offering capabilities that rival commercial models on numerous benchmarks.

How It Works

LLaVA-NeXT builds upon the LLaVA architecture, integrating stronger Large Language Models (LLMs) like Llama-3 and Qwen-1.5. It employs visual instruction tuning, processing interleaved image-text data to unify diverse tasks including multi-image, video, and 3D understanding. This approach enables strong zero-shot modality transfer and competitive performance on video benchmarks.

Quick Start & Requirements

Install: Clone the repository and install with pip install -e ".[train]". A conda environment is recommended (conda create -n llava python=3.10).
Prerequisites: Python 3.10+, PyTorch. Specific model checkpoints and datasets are required for full functionality. Video inference requires specific setup via sglang.
Resources: Training and inference of larger models (7B, 72B) will require significant GPU resources.
Docs: LLaVA-OneVision, LLaVA-NeXT-Image, LLaVA-NeXT-Video, LLaVA-NeXT-Interleave.

Highlighted Details

Achieves state-of-the-art performance on numerous single-image, multi-image, and video benchmarks, rivaling top commercial models.
Supports a wide range of LLMs, including Llama-3 and Qwen-1.5, for enhanced multimodal capabilities.
Offers specialized models and datasets for video understanding (LLaVA-Video-178K) and interleaved image-text tasks.
Provides an efficient evaluation pipeline (LMMs-Eval) for faster development of new LMMs.

Maintenance & Community

The project is actively maintained by a team including Bo Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Renrui Zhang, Yuanhan Zhang, led by Chunyuan Li, with contributions from Haotian Liu. The lmms-eval framework is also supported by key contributors.

Licensing & Compatibility

Uses original licenses of datasets and base language models (e.g., Llama-1/2 community license, Tongyi Qianwen RESEARCH LICENSE AGREEMENT, Llama-3 Research License). Users must comply with these terms, including OpenAI's Terms of Use. No additional constraints are imposed by the project itself.

Limitations & Caveats

The project relies on base models with specific licenses that may restrict commercial use. Users must ensure compliance with all applicable laws and the terms of the underlying model licenses.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

58 stars in the last 30 days