transfusion-pytorch  by lucidrains

Pytorch implementation for multimodal model research

Created 1 year ago
1,204 stars

Top 32.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of Transfusion, a multi-modal model capable of predicting the next token and diffusing images. It targets researchers and practitioners working with multi-modal AI, offering a unified architecture for diverse data types. The key benefit is its flexibility in handling various modalities, including text, images, and audio, within a single transformer-based framework.

How It Works

The core of Transfusion is a transformer architecture that unifies different modalities. Instead of traditional diffusion, it utilizes flow matching, inspired by the success of Flux. This approach allows the model to learn continuous transformations between data representations. The implementation supports handling multiple modalities by allowing specification of different latent dimensions and default shapes for each, enabling flexible data integration.

Quick Start & Requirements

  • Install: pip install transfusion-pytorch
  • Dependencies: PyTorch. For examples, pip install .[examples] which includes diffusers, transformers, accelerate, scipy, ftfy, safetensors.
  • Usage examples are provided for single and multiple modalities, including image encoding/decoding.

Highlighted Details

  • Implements Transfusion, a multi-modal model using flow matching instead of diffusion.
  • Supports flexible integration of multiple modalities (text, images, audio) with configurable latent dimensions.
  • Includes optional modality encoders/decoders for direct image processing.
  • Can be pre-trained on text-only data.

Maintenance & Community

The project is associated with the original Transfusion paper by MetaAI. No specific community channels or active maintainer information are detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The included citations are for research papers, not project licensing.

Limitations & Caveats

The README does not specify a license, which may impact commercial use or integration into closed-source projects. The project appears to be a research implementation, and stability or production-readiness is not guaranteed.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 10 months ago
Updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.