BitDance  by shallowdream204

Powerful multimodal autoregressive model for efficient visual generation

Created 1 week ago

New!

341 stars

Top 81.4% on SourcePulse

GitHubView on GitHub
Project Summary

BitDance is an open-source, 14B parameter autoregressive multimodal model designed for efficient visual generation. It addresses limitations of discrete autoregressive models, such as poor tokenizer reconstruction and slow generation, by introducing a large-vocabulary binary tokenizer, a binary diffusion head, and a novel next-patch diffusion paradigm. This approach enables rapid, high-resolution, photorealistic image synthesis, targeting researchers and power users seeking scalable generative capabilities.

How It Works

BitDance utilizes a decoder-only architecture incorporating a large-vocabulary binary tokenizer and a binary diffusion head. Its key innovation is the "next-patch diffusion paradigm," which allows for parallel prediction of up to 64 visual tokens per step. This contrasts with traditional token-by-token generation, offering a significant speedup (over 30x reported) and improved efficiency for generating high-resolution images. The unified multimodal framework is designed for scalability and simplicity.

Quick Start & Requirements

  • Installation: Clone the repository (https://github.com/shallowdream204/BitDance.git), create a Python 3.11 Conda environment, activate it, and install dependencies via pip install -r requirements.txt and pip install flash_attn==2.8.2 --no-build-isolation.
  • Prerequisites: Python 3.11, flash-attn (v2.8.2), CUDA (implied for GPU usage).
  • Model Download: Use hf download commands for T2I and ImageNet models.
  • Resources: Links to the official website, demo (Huggingface Space: BitDance-Demo), and paper are provided.

Highlighted Details

  • 14B parameter multimodal autoregressive foundation model.
  • Achieves over 30x speedup in generation compared to standard AR models via parallel multi-token prediction (up to 64 tokens/step).
  • Surpasses open-source AR models on text-to-image benchmarks.
  • Provides both PyTorch native and Hugging Face diffusers versions.
  • Includes pre-trained binary visual tokenizers with vocabulary sizes $2^{32}$, $2^{128}$, and $2^{256}$.

Maintenance & Community

Recent updates (February 2026) include the release of a diffusers version and UniWeTok, a unified binary tokenizer. A project website and interactive demo are available.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

As a research project, training code is still being organized and will be released later. Specific limitations or known bugs are not detailed in the README.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
7
Star History
347 stars in the last 13 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.2%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.