OneCAT by onecat-ai

Unified multimodal AI for understanding, generation, and editing

Created 5 months ago

254 stars

Top 99.0% on SourcePulse

Project Summary

Summary

OneCAT is a unified multimodal model addressing the need for integrated understanding, generation, and editing within a single, efficient decoder-only transformer architecture. It targets researchers and engineers working with multimodal AI, offering significant performance gains and reduced computational overhead by eliminating external vision components during inference.

How It Works

The core innovation is a pure decoder-only transformer that eschews external vision encoders (like ViT) or VAE tokenizers at inference time, relying instead on a lightweight patch embedding. It employs a modality-specific Mixture-of-Experts (MoE) structure with specialized FFNs for text, visual understanding, and image synthesis. A novel multi-scale autoregressive mechanism enables coarse-to-fine image generation, drastically cutting decoding steps compared to diffusion models.

Quick Start & Requirements

Installation requires 64-bit Python 3.11.8 and PyTorch 2.5.1. Install dependencies via pip3 install -r requirements.txt. Users must download the OneCAT-3B model weights and the infinity_vae_d32reg.pth tokenizer. Example commands are provided for visual understanding (generate_understanding.py), text-to-image generation (generate_txt2img.py), and image editing (generate_imgedit.py) using accelerate launch. Further details are available in the OneCAT Technical Report.

Highlighted Details

Pure Decoder-Only: Eliminates reliance on external vision encoders or VAEs during inference, simplifying the architecture.
Mixture-of-Experts (MoE): Integrates three specialized FFN experts for distinct multimodal tasks.
Multi-Scale Autoregressive Generation: Employs a "Next Scale Prediction" paradigm for efficient, coarse-to-fine image synthesis.
State-of-the-Art Performance: Claims to outperform existing open-source unified multimodal models across various benchmarks.

Maintenance & Community

Direct contact is available via email (wangyaoming03@meituan.com) or by opening GitHub issues. No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the README.

OneCAT by onecat-ai

Explore Similar Projects

dots.vlm1 by rednote-hilab

cobra by h-zhao1997

Ovis-U1 by AIDC-AI

NextFlow by ByteVisionLab

CM3Leon by kyegomez

Liquid by FoundationVision

Awesome-Unified-Multimodal-Models by AIDC-AI

GLM-Image by zai-org

Ming by inclusionAI

Cosmos-Tokenizer by NVIDIA

Bagel by ByteDance-Seed

Janus by deepseek-ai