AnyGPT by OpenMOSS

Multimodal LLM research paper for any-to-any modality conversion

Created 1 year ago

863 stars

Top 41.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Wing Lian

Founder of Axolotl AI

Project Summary

AnyGPT is a unified multimodal large language model designed for any-to-any modality conversion, including speech, text, images, and music. It targets researchers and developers working with multimodal AI, offering a single model capable of intermodal transformations and conversational capabilities.

How It Works

AnyGPT employs a discrete sequence modeling approach, converting all input modalities into a unified discrete representation. This allows a single Large Language Model (LLM) to process diverse data types through a next-token prediction task. The core advantage lies in its "compression is intelligence" philosophy: high-quality tokenizers and a low-perplexity LLM enable the model to compress vast multimodal internet data, potentially unlocking emergent capabilities beyond text-only LLMs.

Quick Start & Requirements

Install: git clone https://github.com/OpenMOSS/AnyGPT.git, cd AnyGPT, conda create --name AnyGPT python=3.9, conda activate AnyGPT, pip install -r requirements.txt.
Dependencies: Python 3.9, Conda. Requires downloading model weights for AnyGPT-base, AnyGPT-chat, SpeechTokenizer, Soundstorm, SEED-tokenizer, unCLIP SD-UNet, and Encodec-32k.
Resources: Specific hardware requirements are not detailed, but LLM processing typically demands significant GPU resources.
Links: Project Page (for demos), Model Weights

Highlighted Details

Unified processing of speech, text, images, and music.
Supports intermodal conversions (e.g., text-to-image, speech-to-text).
Trained on the custom AnyInstruct dataset for multimodal conversations.
CLI inference available for both base and chat models.

Maintenance & Community

The project is associated with the paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling".
Mentions contributions from SpeechGPT and Vicuna.

Licensing & Compatibility

Released under the original License of LLaMA2. This implies potential restrictions on commercial use and redistribution, depending on the LLaMA2 license terms.

Limitations & Caveats

The README notes that generation may still be unstable due to limitations in data and training resources, suggesting users may need to generate multiple times or experiment with different decoding strategies.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days