Discover and explore top open-source AI tools and projects—updated daily.
onecat-aiUnified multimodal AI for understanding, generation, and editing
Top 98.8% on SourcePulse
Summary
OneCAT is a unified multimodal model addressing the need for integrated understanding, generation, and editing within a single, efficient decoder-only transformer architecture. It targets researchers and engineers working with multimodal AI, offering significant performance gains and reduced computational overhead by eliminating external vision components during inference.
How It Works
The core innovation is a pure decoder-only transformer that eschews external vision encoders (like ViT) or VAE tokenizers at inference time, relying instead on a lightweight patch embedding. It employs a modality-specific Mixture-of-Experts (MoE) structure with specialized FFNs for text, visual understanding, and image synthesis. A novel multi-scale autoregressive mechanism enables coarse-to-fine image generation, drastically cutting decoding steps compared to diffusion models.
Quick Start & Requirements
Installation requires 64-bit Python 3.11.8 and PyTorch 2.5.1. Install dependencies via pip3 install -r requirements.txt. Users must download the OneCAT-3B model weights and the infinity_vae_d32reg.pth tokenizer. Example commands are provided for visual understanding (generate_understanding.py), text-to-image generation (generate_txt2img.py), and image editing (generate_imgedit.py) using accelerate launch. Further details are available in the OneCAT Technical Report.
Highlighted Details
Maintenance & Community
Direct contact is available via email (wangyaoming03@meituan.com) or by opening GitHub issues. No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the README.
Licensing & Compatibility
The project is licensed under the permissive Apache 2.0 license, generally allowing for commercial use and integration into closed-source projects.
Limitations & Caveats
The provided README does not explicitly detail any limitations, alpha status, known bugs, or unsupported platforms.
3 months ago
Inactive