CM3Leon by kyegomez

Open-source implementation of a multimodal AI research paper

Created 2 years ago

364 stars

Top 77.3% on SourcePulse

Project Summary

CM3Leon is an open-source implementation of a multimodal AI model capable of generating both text and images autoregressively. It targets researchers and developers working on advanced generative models, offering a unified decoder for diverse content creation. The project aims to provide a state-of-the-art, computationally efficient alternative for multimodal generation tasks.

How It Works

CM3Leon employs a decoder-only transformer architecture, similar to GPT models, but extended for multimodal inputs. It utilizes a two-stage training process: retrieval-augmented pretraining on a large, diverse dataset and supervised fine-tuning on specific text-image tasks. A key innovation is contrastive decoding, which enhances the quality and coherence of generated samples by balancing conditional and unconditional generation streams.

Quick Start & Requirements

Install: pip3 install cm3
Requirements: PyTorch environment, significant GPU/TPU resources for training, large multimodal datasets (e.g., Shutterstock), custom tokenizer implementation, retrieval infrastructure, and fine-tuning frameworks.
Links: PAPER LINK

Highlighted Details

Achieves state-of-the-art text-to-image generation, outperforming comparable models with 5x less computational resources.
Employs retrieval augmented pretraining and contrastive decoding for improved sample quality.
Supports model sizes ranging from 350M to 7B parameters.
Uses custom tokenizers for text (CommonCrawl) and images (256x256 encoded into 1024 tokens).

Maintenance & Community

The project is marked as "wip" (work in progress) and contributions are welcomed via pull requests and issues.
Support is available through the GitHub issue tracker.

Licensing & Compatibility

Licensed under the MIT license.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The repository is explicitly stated as "not finished" (wip). Replicating the model requires substantial expertise in distributed training, data pipelines, and optimization techniques, along with significant computational infrastructure.

CM3Leon by kyegomez

Explore Similar Projects

OneCAT by onecat-ai

bc-omni by westlake-baichuan-mllm

cobra by h-zhao1997

GroundingGPT by lzw-lzw

Awesome-Unified-Multimodal-Models by AIDC-AI

METER by zdou0830

SEED by AILab-CVC

gill by kohjingyu

Show-o by showlab

open_flamingo by mlfoundations

NExT-GPT by NExT-GPT

Janus by deepseek-ai