Unified discrete autoregressive model for image and language generation
New!
Top 81.7% on SourcePulse
X-Omni provides official inference code and the LongText-Bench benchmark for a unified discrete autoregressive model capable of generating images from text prompts across English and Chinese. It is designed for researchers and practitioners interested in multimodal generative AI, offering superior instruction following and text rendering capabilities in generated images.
How It Works
X-Omni employs a discrete autoregressive modeling approach, unifying image and language generation within a single framework. This method allows for precise control over text rendering within images and supports arbitrary output resolutions. The model leverages reinforcement learning to enhance its performance, particularly in handling complex instructions and generating aesthetically pleasing outputs.
Quick Start & Requirements
conda create -n xomni python==3.12
, conda activate xomni
). Install dependencies via pip install -r requirements.txt
and pip install flash-attn --no-build-isolation
.flash-attn
.transformers==4.52.0
and qwen_vl_utils
. Evaluation uses a distributed script.Highlighted Details
Maintenance & Community
The project is associated with Tencent Hunyuan X Team. Contact information for Yibing Wang and Xiaosong Zhang is provided for inquiries and collaboration.
Licensing & Compatibility
The repository does not explicitly state a license. The model weights are available on Hugging Face. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README does not specify any limitations or known issues. The project appears to be recent, with a paper published in 2025.
2 weeks ago
Inactive