TokenFlow by ByteVisionLab

Image tokenizer for multimodal tasks (research paper)

Created 1 year ago

464 stars

Top 64.6% on SourcePulse

Project Summary

TokenFlow offers a unified image tokenizer designed to bridge multimodal understanding and generation tasks. It targets researchers and developers working with vision-language models, providing a novel approach to image representation that enhances performance in both understanding and generation.

How It Works

TokenFlow employs a dual-codebook architecture that separates semantic and pixel-level feature learning. This decoupling, managed by a shared mapping mechanism, allows for more granular control and improved alignment between visual and textual modalities. This approach aims to achieve state-of-the-art results in multimodal understanding benchmarks and competitive text-to-image generation quality.

Quick Start & Requirements

Detailed instructions for training and evaluation of the tokenizer, multimodal understanding, and text-to-image models are available in GETTING_STARTED.md. Checkpoints for various model sizes and configurations are provided on Hugging Face.

Highlighted Details

Achieves superior performance on multimodal understanding tasks compared to LLaVA-1.5 and EMU3.
Delivers comparable text-to-image generation performance to SDXL at 256x256 resolution.
Features a dual-codebook architecture for decoupled semantic and pixel-level feature learning.
Official implementation accepted to CVPR 2025.

Maintenance & Community

The project is actively maintained by ByteFlow-AI, with code and checkpoints released in December 2024. The project page and paper are available for further details. Open positions for researchers are advertised.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that a single-scale version of TokenFlow is planned for release but not yet available. Further details on specific hardware requirements, such as GPU or CUDA versions, are not immediately apparent from the provided text.

TokenFlow by ByteVisionLab

Explore Similar Projects

dots.vlm1 by rednote-hilab

Cheers by AI9Stars

NextStep-1 by stepfun-ai

BitDance by shallowdream204

Lumina-mGPT by Alpha-VLLM

Liquid by FoundationVision

gill by kohjingyu

GLM-Image by zai-org

HunyuanVideo-I2V by Tencent-Hunyuan

ideogram4 by ideogram-oss

Bagel by ByteDance-Seed

Janus by deepseek-ai