Cheers  by AI9Stars

Unified multimodal model for comprehension and generation

Created 1 month ago
272 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Cheers is a unified multimodal model addressing the challenge of jointly optimizing visual comprehension and generation. It targets researchers and practitioners by decoupling visual patch details from semantic representations, enabling stable understanding and high-fidelity generation. The model offers an efficient and effective unified approach, matching state-of-the-art performance with significantly reduced training costs.

How It Works

Cheers separates fine-grained visual details from semantic information using a unified vision tokenizer, an LLM-based Transformer for text/image generation, and a cascaded flow matching head. This head decodes semantics and injects semantically gated detail residuals for refinement. This decoupling stabilizes semantics for comprehension and enhances generation fidelity, enabling efficient unified multimodal modeling with substantial token compression.

Quick Start & Requirements

Setup requires Python 3.11 and pip install -r requirements.txt. Inference needs PyTorch with CUDA and bfloat16. A demo is available on Hugging Face. Training uses the VeOmni framework (Training/ directory) with specific scripts and data formats. Evaluation is supported via VLMEvalKit and GenEval. Training on 3.8M samples reportedly takes ~2 days on 8x A100 GPUs.

Highlighted Details

  • Matches or surpasses advanced UMMs in visual understanding and generation benchmarks.
  • Outperforms Tar-1.5B on GenEval/MMBench with 20% training cost.
  • Enables efficient unified multimodal modeling via 4x token compression.
  • Training framework supports image editing.

Maintenance & Community

Developed by researchers from Tsinghua, Xi'an Jiaotong, and UCAS. Contact emails are provided; no explicit community channels are listed. Ongoing development is indicated by plans for releasing training data and a v1.1 update.

Licensing & Compatibility

The README does not specify the software license. This omission requires clarification for users considering commercial use or integration into closed-source projects, as license type and compatibility cannot be determined.

Limitations & Caveats

Training data is not yet released, potentially hindering full reproduction. The project is new (March 2026 release), suggesting ongoing evolution. The absence of explicit licensing information is a significant adoption blocker.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
393 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.