MLLM research paper for on-demand spatial-temporal understanding
Top 86.3% on sourcepulse
Oryx is a unified multimodal large language model (MLLM) designed for on-demand spatial-temporal understanding across images, videos, and 3D scenes. It enables seamless processing of visual inputs with arbitrary spatial and temporal resolutions, targeting researchers and developers working with complex visual data.
How It Works
Oryx employs a novel "on-demand visual perception" approach, featuring a dynamic compressor and native resolution perception. This architecture allows it to adaptively handle varying input resolutions and temporal lengths without fixed-size constraints, offering efficient and flexible visual understanding. The model integrates a custom Oryx-ViT visual encoder with various LLM backbones like Qwen and Yi.
Quick Start & Requirements
conda
and pip
.Highlighted Details
Maintenance & Community
The project is actively developed by researchers from Tsinghua University and Tencent. News updates are frequent, with recent releases of 1.5 series models and training data.
Licensing & Compatibility
The repository does not explicitly state a license. The codebase is based on LLaVA, which is typically under an Apache 2.0 license, but this should be verified for Oryx specifically.
Limitations & Caveats
The project is presented as an ICLR 2025 submission, indicating it is a research artifact. While checkpoints and demos are available, comprehensive documentation for all features or extensive community support may still be evolving. The "TODO List" indicates ongoing development.
4 weeks ago
1 day