TokenFlow  by ByteFlow-AI

Image tokenizer for multimodal tasks (research paper)

created 8 months ago
365 stars

Top 78.2% on sourcepulse

GitHubView on GitHub
Project Summary

TokenFlow offers a unified image tokenizer designed to bridge multimodal understanding and generation tasks. It targets researchers and developers working with vision-language models, providing a novel approach to image representation that enhances performance in both understanding and generation.

How It Works

TokenFlow employs a dual-codebook architecture that separates semantic and pixel-level feature learning. This decoupling, managed by a shared mapping mechanism, allows for more granular control and improved alignment between visual and textual modalities. This approach aims to achieve state-of-the-art results in multimodal understanding benchmarks and competitive text-to-image generation quality.

Quick Start & Requirements

Detailed instructions for training and evaluation of the tokenizer, multimodal understanding, and text-to-image models are available in GETTING_STARTED.md. Checkpoints for various model sizes and configurations are provided on Hugging Face.

Highlighted Details

  • Achieves superior performance on multimodal understanding tasks compared to LLaVA-1.5 and EMU3.
  • Delivers comparable text-to-image generation performance to SDXL at 256x256 resolution.
  • Features a dual-codebook architecture for decoupled semantic and pixel-level feature learning.
  • Official implementation accepted to CVPR 2025.

Maintenance & Community

The project is actively maintained by ByteFlow-AI, with code and checkpoints released in December 2024. The project page and paper are available for further details. Open positions for researchers are advertised.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that a single-scale version of TokenFlow is planned for release but not yet available. Further details on specific hardware requirements, such as GPU or CUDA versions, are not immediately apparent from the provided text.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
50 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.