Oryx by Oryx-mllm

MLLM research paper for on-demand spatial-temporal understanding

Created 1 year ago

331 stars

Top 82.8% on SourcePulse

Project Summary

Oryx is a unified multimodal large language model (MLLM) designed for on-demand spatial-temporal understanding across images, videos, and 3D scenes. It enables seamless processing of visual inputs with arbitrary spatial and temporal resolutions, targeting researchers and developers working with complex visual data.

How It Works

Oryx employs a novel "on-demand visual perception" approach, featuring a dynamic compressor and native resolution perception. This architecture allows it to adaptively handle varying input resolutions and temporal lengths without fixed-size constraints, offering efficient and flexible visual understanding. The model integrates a custom Oryx-ViT visual encoder with various LLM backbones like Qwen and Yi.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using conda and pip.
Prerequisites: Python 3.10, PyTorch. Model checkpoints and vision encoder weights need to be downloaded from Hugging Face.
Resources: Requires downloading model weights (7B, 34B, 1.5 variants) and a vision encoder. Training requires custom data preparation.
Links: Project Page, arXiv Paper, Demo, Checkpoints, Data.

Highlighted Details

Achieves state-of-the-art performance on image, video, and 3D benchmarks, surpassing commercial models on some tasks.
Ranks 1st on MLVU leaderboard, outperforming GPT-4o.
Offers 7B and 34B parameter variants, with newer 1.5 series models using Qwen-2.5.
Codebase is conducted on LLaVA.

Maintenance & Community

The project is actively developed by researchers from Tsinghua University and Tencent. News updates are frequent, with recent releases of 1.5 series models and training data.

Licensing & Compatibility

The repository does not explicitly state a license. The codebase is based on LLaVA, which is typically under an Apache 2.0 license, but this should be verified for Oryx specifically.

Limitations & Caveats

The project is presented as an ICLR 2025 submission, indicating it is a research artifact. While checkpoints and demos are available, comprehensive documentation for all features or extensive community support may still be evolving. The "TODO List" indicates ongoing development.

Oryx by Oryx-mllm

Explore Similar Projects

taehv by madebyollin

gcd by basilevh

Pixel-Reasoner by TIGER-AI-Lab

Keye by Kwai-Keye

LLaVA-UHD by thunlp

RADIO by NVlabs

HY-WorldPlay by Tencent-Hunyuan

InternLM-XComposer by InternLM

EasyAnimate by aigc-apps

LLaVA-NeXT by LLaVA-VL

Bagel by ByteDance-Seed

InternVL by OpenGVLab