BitDance by shallowdream204

Powerful multimodal autoregressive model for efficient visual generation

Created 5 months ago

479 stars

Top 63.2% on SourcePulse

Project Summary

BitDance is an open-source, 14B parameter autoregressive multimodal model designed for efficient visual generation. It addresses limitations of discrete autoregressive models, such as poor tokenizer reconstruction and slow generation, by introducing a large-vocabulary binary tokenizer, a binary diffusion head, and a novel next-patch diffusion paradigm. This approach enables rapid, high-resolution, photorealistic image synthesis, targeting researchers and power users seeking scalable generative capabilities.

How It Works

BitDance utilizes a decoder-only architecture incorporating a large-vocabulary binary tokenizer and a binary diffusion head. Its key innovation is the "next-patch diffusion paradigm," which allows for parallel prediction of up to 64 visual tokens per step. This contrasts with traditional token-by-token generation, offering a significant speedup (over 30x reported) and improved efficiency for generating high-resolution images. The unified multimodal framework is designed for scalability and simplicity.

Quick Start & Requirements

Installation: Clone the repository (https://github.com/shallowdream204/BitDance.git), create a Python 3.11 Conda environment, activate it, and install dependencies via pip install -r requirements.txt and pip install flash_attn==2.8.2 --no-build-isolation.
Prerequisites: Python 3.11, flash-attn (v2.8.2), CUDA (implied for GPU usage).
Model Download: Use hf download commands for T2I and ImageNet models.
Resources: Links to the official website, demo (Huggingface Space: BitDance-Demo), and paper are provided.

Highlighted Details

14B parameter multimodal autoregressive foundation model.
Achieves over 30x speedup in generation compared to standard AR models via parallel multi-token prediction (up to 64 tokens/step).
Surpasses open-source AR models on text-to-image benchmarks.
Provides both PyTorch native and Hugging Face diffusers versions.
Includes pre-trained binary visual tokenizers with vocabulary sizes $2^{32}$, $2^{128}$, and $2^{256}$.

Maintenance & Community

Recent updates (February 2026) include the release of a diffusers version and UniWeTok, a unified binary tokenizer. A project website and interactive demo are available.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

As a research project, training code is still being organized and will be released later. Specific limitations or known bugs are not detailed in the README.

BitDance by shallowdream204

Explore Similar Projects

Scale-RAE by ZitengWangNYU

TokenFlow by ByteVisionLab

NextFlow by ByteVisionLab

LLaVA-UHD by thunlp

Cheers by AI9Stars

NextStep-1 by stepfun-ai

tuna-2 by facebookresearch

LaVIT by jy0205

Liquid by FoundationVision

LLaDA2.0-Uni by inclusionAI

Cosmos-Tokenizer by NVIDIA

LlamaGen by FoundationVision