OmniVinci by NVlabs

Omni-modal LLM for joint perception and reasoning

Created 5 months ago

639 stars

Top 52.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

OmniVinci: Omni-Modal LLM for Joint Understanding

OmniVinci is an open-source, omni-modal Large Language Model (LLM) designed for joint understanding across vision, audio, and language. It addresses the need for AI systems that perceive and reason across multiple sensory inputs, offering enhanced performance with significantly reduced training data compared to existing models. The project targets researchers and developers working on multimodal AI applications in fields like robotics, medical AI, and smart factories.

How It Works

OmniVinci introduces three key architectural innovations: OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared latent space, Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals, and Constrained Rotary Time Embedding for encoding absolute temporal information. These are supported by a data curation and synthesis pipeline generating 24 million single-modal and omni-modal conversations. This approach allows modalities to mutually reinforce perception and reasoning, leading to improved performance and efficiency.

Quick Start & Requirements

Installation: Download the model from Hugging Face using huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False. Set up the Python environment using bash ./environment_setup.sh (based on NVILA codebase).
Prerequisites: Python environment setup script is provided. Inference requires the transformers library. The example uses torch_dtype="torch.float16" and device_map="auto", suggesting GPU acceleration is beneficial.
Usage: Inference examples for video (with audio), audio, and image are available in the repository. The provided video inference example demonstrates loading a model and processor from a local path and generating text based on video input.
Links: Hugging Face repo: https://huggingface.co/nvidia/omnivinci

Highlighted Details

Outperforms Qwen2.5-Omni by +19.05 on DailyOmni, +1.7 on MMAR, and +3.9 on Video-MME.
Achieves superior performance using only 0.2T training tokens, a 6x reduction compared to Qwen2.5-Omni's 1.2T.
Demonstrates omni-modal advantages in downstream applications including robotics, medical AI, and smart factory.
The model supports joint understanding of vision, audio, and text.

Maintenance & Community

The README does not provide specific details on community channels (e.g., Discord, Slack), roadmap, or notable sponsorships. The project is presented as an initiative from NVIDIA.

Licensing & Compatibility

The README does not explicitly state the software license. It references an arXiv paper for the research, which typically implies research-oriented usage. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The inference code requires trust_remote_code=True, necessitating careful security review. The project is presented as a recent release ("OmniVinci-9B is released!"), and detailed limitations or known issues are not explicitly listed in the provided README excerpt.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days