Cheers by AI9Stars

Unified multimodal model for comprehension and generation

Created 4 months ago

256 stars

Top 98.5% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Cheers is a unified multimodal model addressing the challenge of jointly optimizing visual comprehension and generation. It targets researchers and practitioners by decoupling visual patch details from semantic representations, enabling stable understanding and high-fidelity generation. The model offers an efficient and effective unified approach, matching state-of-the-art performance with significantly reduced training costs.

How It Works

Cheers separates fine-grained visual details from semantic information using a unified vision tokenizer, an LLM-based Transformer for text/image generation, and a cascaded flow matching head. This head decodes semantics and injects semantically gated detail residuals for refinement. This decoupling stabilizes semantics for comprehension and enhances generation fidelity, enabling efficient unified multimodal modeling with substantial token compression.

Quick Start & Requirements

Setup requires Python 3.11 and pip install -r requirements.txt. Inference needs PyTorch with CUDA and bfloat16. A demo is available on Hugging Face. Training uses the VeOmni framework (Training/ directory) with specific scripts and data formats. Evaluation is supported via VLMEvalKit and GenEval. Training on 3.8M samples reportedly takes ~2 days on 8x A100 GPUs.

Highlighted Details

Matches or surpasses advanced UMMs in visual understanding and generation benchmarks.
Outperforms Tar-1.5B on GenEval/MMBench with 20% training cost.
Enables efficient unified multimodal modeling via 4x token compression.
Training framework supports image editing.

Maintenance & Community

Developed by researchers from Tsinghua, Xi'an Jiaotong, and UCAS. Contact emails are provided; no explicit community channels are listed. Ongoing development is indicated by plans for releasing training data and a v1.1 update.

Licensing & Compatibility

The README does not specify the software license. This omission requires clarification for users considering commercial use or integration into closed-source projects, as license type and compatibility cannot be determined.

Limitations & Caveats

Training data is not yet released, potentially hindering full reproduction. The project is new (March 2026 release), suggesting ongoing evolution. The absence of explicit licensing information is a significant adoption blocker.

Cheers by AI9Stars

Explore Similar Projects

Scale-RAE by ZitengWangNYU

OneCAT by onecat-ai

TokenFlow by ByteVisionLab

NextFlow by ByteVisionLab

NextStep-1 by stepfun-ai

LaVIT by jy0205

Liquid by FoundationVision

LLaDA2.0-Uni by inclusionAI

Lumina-mGPT-2.0 by Alpha-VLLM

Cosmos-Tokenizer by NVIDIA

Emu3 by baaivision

Janus by deepseek-ai