Rex-Omni  by IDEA-Research

Multimodal LLM for versatile visual perception via next-point prediction

Created 1 month ago
649 stars

Top 51.4% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that reframes diverse visual perception tasks, including object detection, as a next-token prediction problem. It offers a unified framework for researchers and developers seeking advanced visual understanding capabilities, simplifying complex perception tasks through a novel generative approach.

How It Works

The core innovation lies in treating complex vision tasks as a sequence generation problem solvable by an LLM. By predicting the next token, the model can output structured data for tasks like object bounding boxes, keypoints, or segmentation masks, offering a novel, unified approach to visual perception. This generative paradigm allows for flexibility across various downstream applications.

Quick Start & Requirements

Installation requires Python 3.10, PyTorch 2.6.0 with CUDA 12.4 support, and torchvision 0.21.0. Setup involves creating a Conda environment (conda create -n rexomni -m python=3.10), installing PyTorch (pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124), cloning the repository (git clone https://github.com/IDEA-Research/Rex-Omni.git), navigating into the directory (cd Rex-Omni), and installing the package in editable mode (pip install -v -e .). A CUDA-enabled GPU is necessary for running the provided examples, such as CUDA_VISIBLE_DEVICES=1 python tutorials/detection_example/detection_example.py. Official quick-start examples and a Gradio demo are available within the repository.

Highlighted Details

  • Supports a wide array of tasks: object detection, pointing, visual prompting, keypoint detection, OCR (box/polygon), and GUI grounding.
  • Offers two inference backends: transformers for ease of use and vllm for high-throughput, low-latency inference.
  • Integrates with other models like SAM for enhanced segmentation and can be used for automated data annotation via its Grounding Data Engine.
  • Evaluation code and datasets were released in October 2025, alongside the model's initial release.

Maintenance & Community

The provided README does not detail specific community channels (e.g., Discord/Slack), active contributors, or a public roadmap. News sections indicate recent development activity with releases in October 2025.

Licensing & Compatibility

The project is released under the "IDEA License 1.0" and is based on Qwen, which is subject to the "Qwen RESEARCH LICENSE AGREEMENT." Both licenses are research-focused and likely impose restrictions on commercial use, requiring careful review before adoption in production environments.

Limitations & Caveats

Fine-tuning capabilities are listed as a future TODO item. The dual research-focused licenses may pose adoption blockers for commercial applications. As of its October 2025 release, the project is relatively new and may still be undergoing active development and refinement.

Health Check
Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
44
Star History
661 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.