OneCAT  by onecat-ai

Unified multimodal AI for understanding, generation, and editing

Created 4 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

OneCAT is a unified multimodal model addressing the need for integrated understanding, generation, and editing within a single, efficient decoder-only transformer architecture. It targets researchers and engineers working with multimodal AI, offering significant performance gains and reduced computational overhead by eliminating external vision components during inference.

How It Works

The core innovation is a pure decoder-only transformer that eschews external vision encoders (like ViT) or VAE tokenizers at inference time, relying instead on a lightweight patch embedding. It employs a modality-specific Mixture-of-Experts (MoE) structure with specialized FFNs for text, visual understanding, and image synthesis. A novel multi-scale autoregressive mechanism enables coarse-to-fine image generation, drastically cutting decoding steps compared to diffusion models.

Quick Start & Requirements

Installation requires 64-bit Python 3.11.8 and PyTorch 2.5.1. Install dependencies via pip3 install -r requirements.txt. Users must download the OneCAT-3B model weights and the infinity_vae_d32reg.pth tokenizer. Example commands are provided for visual understanding (generate_understanding.py), text-to-image generation (generate_txt2img.py), and image editing (generate_imgedit.py) using accelerate launch. Further details are available in the OneCAT Technical Report.

Highlighted Details

  • Pure Decoder-Only: Eliminates reliance on external vision encoders or VAEs during inference, simplifying the architecture.
  • Mixture-of-Experts (MoE): Integrates three specialized FFN experts for distinct multimodal tasks.
  • Multi-Scale Autoregressive Generation: Employs a "Next Scale Prediction" paradigm for efficient, coarse-to-fine image synthesis.
  • State-of-the-Art Performance: Claims to outperform existing open-source unified multimodal models across various benchmarks.

Maintenance & Community

Direct contact is available via email (wangyaoming03@meituan.com) or by opening GitHub issues. No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the README.

Licensing & Compatibility

The project is licensed under the permissive Apache 2.0 license, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail any limitations, alpha status, known bugs, or unsupported platforms.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.