Oryx  by Oryx-mllm

MLLM research paper for on-demand spatial-temporal understanding

created 10 months ago
318 stars

Top 86.3% on sourcepulse

GitHubView on GitHub
Project Summary

Oryx is a unified multimodal large language model (MLLM) designed for on-demand spatial-temporal understanding across images, videos, and 3D scenes. It enables seamless processing of visual inputs with arbitrary spatial and temporal resolutions, targeting researchers and developers working with complex visual data.

How It Works

Oryx employs a novel "on-demand visual perception" approach, featuring a dynamic compressor and native resolution perception. This architecture allows it to adaptively handle varying input resolutions and temporal lengths without fixed-size constraints, offering efficient and flexible visual understanding. The model integrates a custom Oryx-ViT visual encoder with various LLM backbones like Qwen and Yi.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using conda and pip.
  • Prerequisites: Python 3.10, PyTorch. Model checkpoints and vision encoder weights need to be downloaded from Hugging Face.
  • Resources: Requires downloading model weights (7B, 34B, 1.5 variants) and a vision encoder. Training requires custom data preparation.
  • Links: Project Page, arXiv Paper, Demo, Checkpoints, Data.

Highlighted Details

  • Achieves state-of-the-art performance on image, video, and 3D benchmarks, surpassing commercial models on some tasks.
  • Ranks 1st on MLVU leaderboard, outperforming GPT-4o.
  • Offers 7B and 34B parameter variants, with newer 1.5 series models using Qwen-2.5.
  • Codebase is conducted on LLaVA.

Maintenance & Community

The project is actively developed by researchers from Tsinghua University and Tencent. News updates are frequent, with recent releases of 1.5 series models and training data.

Licensing & Compatibility

The repository does not explicitly state a license. The codebase is based on LLaVA, which is typically under an Apache 2.0 license, but this should be verified for Oryx specifically.

Limitations & Caveats

The project is presented as an ICLR 2025 submission, indicating it is a research artifact. While checkpoints and demos are available, comprehensive documentation for all features or extensive community support may still be evolving. The "TODO List" indicates ongoing development.

Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.