X-Omni  by X-Omni-Team

Unified discrete autoregressive model for image and language generation

Created 2 months ago
379 stars

Top 75.1% on SourcePulse

GitHubView on GitHub
Project Summary

X-Omni provides official inference code and the LongText-Bench benchmark for a unified discrete autoregressive model capable of generating images from text prompts across English and Chinese. It is designed for researchers and practitioners interested in multimodal generative AI, offering superior instruction following and text rendering capabilities in generated images.

How It Works

X-Omni employs a discrete autoregressive modeling approach, unifying image and language generation within a single framework. This method allows for precise control over text rendering within images and supports arbitrary output resolutions. The model leverages reinforcement learning to enhance its performance, particularly in handling complex instructions and generating aesthetically pleasing outputs.

Quick Start & Requirements

  • Installation: Requires Python 3.12 and uses Conda for environment management (conda create -n xomni python==3.12, conda activate xomni). Install dependencies via pip install -r requirements.txt and pip install flash-attn --no-build-isolation.
  • Prerequisites: CUDA 12 is recommended for flash-attn.
  • Inference: Examples provided for English and Chinese image generation, and multi-modal chat. Requires downloading FLUX.1-dev model weights.
  • LongText-Bench: Requires transformers==4.52.0 and qwen_vl_utils. Evaluation uses a distributed script.
  • Links: Project Page, Paper, Model, Space, LongText-Bench.

Highlighted Details

  • Unified discrete autoregressive model for image and language.
  • Superior instruction following and text rendering (English/Chinese).
  • Generates images at arbitrary resolutions.
  • Includes the LongText-Bench benchmark for evaluation.

Maintenance & Community

The project is associated with Tencent Hunyuan X Team. Contact information for Yibing Wang and Xiaosong Zhang is provided for inquiries and collaboration.

Licensing & Compatibility

The repository does not explicitly state a license. The model weights are available on Hugging Face. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify any limitations or known issues. The project appears to be recent, with a paper published in 2025.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
7 more.

glide-text2im by openai

0.1%
4k
Text-conditional image synthesis model from research paper
Created 3 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.