VILA  by NVlabs

Open-source VLMs for efficient video/multi-image understanding

created 1 year ago
3,460 stars

Top 14.2% on sourcepulse

GitHubView on GitHub
Project Summary

VILA is a family of state-of-the-art Vision-Language Models (VLMs) designed for efficient multimodal AI tasks, including image and video understanding. It targets researchers and developers seeking high-performance, deployable VLMs across edge, data center, and cloud environments, offering optimized accuracy and efficiency.

How It Works

VILA employs an interleaved image-text pre-training approach, enabling multi-image reasoning and robust in-context learning. Recent versions like NVILA (VILA 2.0) focus on full-stack efficiency, from cheaper training to faster deployment and improved performance. LongVILA extends capabilities to over 1 million tokens for long-context video understanding.

Quick Start & Requirements

  • Installation: Clone the repository and run ./environment_setup.sh.
  • Prerequisites: Anaconda, Python packages (specified in environment_setup.sh). Optional: one-logger-utils for NVIDIA employees.
  • Deployment: Supports AWQ-quantized 4-bit models via TinyChat for NVIDIA GPUs (A100, 4090, Orin) and TinyChatEngine for CPU (x86, ARM).
  • Resources: Training requires multi-GPU setups (e.g., 8xA100 nodes). Inference performance benchmarks are provided for various NVIDIA GPUs.
  • Links: arXiv, Demo, Models

Highlighted Details

  • Achieves state-of-the-art performance on MMMU and Video-MME leaderboards for OSS models.
  • Offers video understanding capabilities with models supporting up to 1M+ token context length.
  • Provides AWQ-quantized 4-bit models for efficient deployment on diverse NVIDIA GPUs and CPUs.
  • Includes training, evaluation, and inference scripts, along with an API server for research purposes.

Maintenance & Community

The project is actively developed by NVlabs, with significant contributions from researchers at NVIDIA and MIT. Recent updates include NVILA (VILA 2.0), LongVILA, and VILA-M3. The project is associated with the Cosmos Nemotron family.

Licensing & Compatibility

  • Code: Apache 2.0 license.
  • Pretrained Weights: CC-BY-NC-SA-4.0 license.
  • LLaMA3-VILA Checkpoints: Subject to LLaMA3 License.
  • Compatibility: Primarily intended for non-commercial research use due to weight licenses.

Limitations & Caveats

The provided API server is for evaluation, not production optimization. The CC-BY-NC-SA-4.0 license on pretrained weights restricts commercial use.

Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
3
Star History
278 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.