X-Omni  by X-Omni-Team

Unified discrete autoregressive model for image and language generation

created 2 weeks ago

New!

336 stars

Top 81.7% on SourcePulse

GitHubView on GitHub
Project Summary

X-Omni provides official inference code and the LongText-Bench benchmark for a unified discrete autoregressive model capable of generating images from text prompts across English and Chinese. It is designed for researchers and practitioners interested in multimodal generative AI, offering superior instruction following and text rendering capabilities in generated images.

How It Works

X-Omni employs a discrete autoregressive modeling approach, unifying image and language generation within a single framework. This method allows for precise control over text rendering within images and supports arbitrary output resolutions. The model leverages reinforcement learning to enhance its performance, particularly in handling complex instructions and generating aesthetically pleasing outputs.

Quick Start & Requirements

  • Installation: Requires Python 3.12 and uses Conda for environment management (conda create -n xomni python==3.12, conda activate xomni). Install dependencies via pip install -r requirements.txt and pip install flash-attn --no-build-isolation.
  • Prerequisites: CUDA 12 is recommended for flash-attn.
  • Inference: Examples provided for English and Chinese image generation, and multi-modal chat. Requires downloading FLUX.1-dev model weights.
  • LongText-Bench: Requires transformers==4.52.0 and qwen_vl_utils. Evaluation uses a distributed script.
  • Links: Project Page, Paper, Model, Space, LongText-Bench.

Highlighted Details

  • Unified discrete autoregressive model for image and language.
  • Superior instruction following and text rendering (English/Chinese).
  • Generates images at arbitrary resolutions.
  • Includes the LongText-Bench benchmark for evaluation.

Maintenance & Community

The project is associated with Tencent Hunyuan X Team. Contact information for Yibing Wang and Xiaosong Zhang is provided for inquiries and collaboration.

Licensing & Compatibility

The repository does not explicitly state a license. The model weights are available on Hugging Face. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify any limitations or known issues. The project appears to be recent, with a paper published in 2025.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
12
Star History
337 stars in the last 18 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
7 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.