LLaVA-NeXT  by LLaVA-VL

Multimodal model for image, video, and 3D understanding

created 1 year ago
4,065 stars

Top 12.3% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-NeXT is an open-source project providing advanced Large Multimodal Models (LMMs) that excel in visual understanding across single images, multiple images, and videos. It targets researchers and developers seeking state-of-the-art performance in multimodal AI, offering capabilities that rival commercial models on numerous benchmarks.

How It Works

LLaVA-NeXT builds upon the LLaVA architecture, integrating stronger Large Language Models (LLMs) like Llama-3 and Qwen-1.5. It employs visual instruction tuning, processing interleaved image-text data to unify diverse tasks including multi-image, video, and 3D understanding. This approach enables strong zero-shot modality transfer and competitive performance on video benchmarks.

Quick Start & Requirements

  • Install: Clone the repository and install with pip install -e ".[train]". A conda environment is recommended (conda create -n llava python=3.10).
  • Prerequisites: Python 3.10+, PyTorch. Specific model checkpoints and datasets are required for full functionality. Video inference requires specific setup via sglang.
  • Resources: Training and inference of larger models (7B, 72B) will require significant GPU resources.
  • Docs: LLaVA-OneVision, LLaVA-NeXT-Image, LLaVA-NeXT-Video, LLaVA-NeXT-Interleave.

Highlighted Details

  • Achieves state-of-the-art performance on numerous single-image, multi-image, and video benchmarks, rivaling top commercial models.
  • Supports a wide range of LLMs, including Llama-3 and Qwen-1.5, for enhanced multimodal capabilities.
  • Offers specialized models and datasets for video understanding (LLaVA-Video-178K) and interleaved image-text tasks.
  • Provides an efficient evaluation pipeline (LMMs-Eval) for faster development of new LMMs.

Maintenance & Community

The project is actively maintained by a team including Bo Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Renrui Zhang, Yuanhan Zhang, led by Chunyuan Li, with contributions from Haotian Liu. The lmms-eval framework is also supported by key contributors.

Licensing & Compatibility

Uses original licenses of datasets and base language models (e.g., Llama-1/2 community license, Tongyi Qianwen RESEARCH LICENSE AGREEMENT, Llama-3 Research License). Users must comply with these terms, including OpenAI's Terms of Use. No additional constraints are imposed by the project itself.

Limitations & Caveats

The project relies on base models with specific licenses that may restrict commercial use. Users must ensure compliance with all applicable laws and the terms of the underlying model licenses.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
344 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.