VILA by NVlabs

Open-source VLMs for efficient video/multi-image understanding

Created 1 year ago

3,719 stars

Top 12.9% on SourcePulse

View on GitHub

6 Experts Love This Project

Gabriel Almeida

Cofounder of Langflow

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Yaowei Zheng

Author of LLaMA-Factory

Alex Yu

Research Scientist at OpenAI; Cofounder of Luma AI

and 2 more!

Project Summary

VILA is a family of state-of-the-art Vision-Language Models (VLMs) designed for efficient multimodal AI tasks, including image and video understanding. It targets researchers and developers seeking high-performance, deployable VLMs across edge, data center, and cloud environments, offering optimized accuracy and efficiency.

How It Works

VILA employs an interleaved image-text pre-training approach, enabling multi-image reasoning and robust in-context learning. Recent versions like NVILA (VILA 2.0) focus on full-stack efficiency, from cheaper training to faster deployment and improved performance. LongVILA extends capabilities to over 1 million tokens for long-context video understanding.

Quick Start & Requirements

Installation: Clone the repository and run ./environment_setup.sh.
Prerequisites: Anaconda, Python packages (specified in environment_setup.sh). Optional: one-logger-utils for NVIDIA employees.
Deployment: Supports AWQ-quantized 4-bit models via TinyChat for NVIDIA GPUs (A100, 4090, Orin) and TinyChatEngine for CPU (x86, ARM).
Resources: Training requires multi-GPU setups (e.g., 8xA100 nodes). Inference performance benchmarks are provided for various NVIDIA GPUs.
Links: arXiv, Demo, Models

Highlighted Details

Achieves state-of-the-art performance on MMMU and Video-MME leaderboards for OSS models.
Offers video understanding capabilities with models supporting up to 1M+ token context length.
Provides AWQ-quantized 4-bit models for efficient deployment on diverse NVIDIA GPUs and CPUs.
Includes training, evaluation, and inference scripts, along with an API server for research purposes.

Maintenance & Community

The project is actively developed by NVlabs, with significant contributions from researchers at NVIDIA and MIT. Recent updates include NVILA (VILA 2.0), LongVILA, and VILA-M3. The project is associated with the Cosmos Nemotron family.

Licensing & Compatibility

Code: Apache 2.0 license.
Pretrained Weights: CC-BY-NC-SA-4.0 license.
LLaMA3-VILA Checkpoints: Subject to LLaMA3 License.
Compatibility: Primarily intended for non-commercial research use due to weight licenses.

Limitations & Caveats

The provided API server is for evaluation, not production optimization. The CC-BY-NC-SA-4.0 license on pretrained weights restricts commercial use.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days