AnyGPT  by OpenMOSS

Multimodal LLM research paper for any-to-any modality conversion

created 1 year ago
854 stars

Top 42.8% on sourcepulse

GitHubView on GitHub
Project Summary

AnyGPT is a unified multimodal large language model designed for any-to-any modality conversion, including speech, text, images, and music. It targets researchers and developers working with multimodal AI, offering a single model capable of intermodal transformations and conversational capabilities.

How It Works

AnyGPT employs a discrete sequence modeling approach, converting all input modalities into a unified discrete representation. This allows a single Large Language Model (LLM) to process diverse data types through a next-token prediction task. The core advantage lies in its "compression is intelligence" philosophy: high-quality tokenizers and a low-perplexity LLM enable the model to compress vast multimodal internet data, potentially unlocking emergent capabilities beyond text-only LLMs.

Quick Start & Requirements

  • Install: git clone https://github.com/OpenMOSS/AnyGPT.git, cd AnyGPT, conda create --name AnyGPT python=3.9, conda activate AnyGPT, pip install -r requirements.txt.
  • Dependencies: Python 3.9, Conda. Requires downloading model weights for AnyGPT-base, AnyGPT-chat, SpeechTokenizer, Soundstorm, SEED-tokenizer, unCLIP SD-UNet, and Encodec-32k.
  • Resources: Specific hardware requirements are not detailed, but LLM processing typically demands significant GPU resources.
  • Links: Project Page (for demos), Model Weights

Highlighted Details

  • Unified processing of speech, text, images, and music.
  • Supports intermodal conversions (e.g., text-to-image, speech-to-text).
  • Trained on the custom AnyInstruct dataset for multimodal conversations.
  • CLI inference available for both base and chat models.

Maintenance & Community

  • The project is associated with the paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling".
  • Mentions contributions from SpeechGPT and Vicuna.

Licensing & Compatibility

  • Released under the original License of LLaMA2. This implies potential restrictions on commercial use and redistribution, depending on the LLaMA2 license terms.

Limitations & Caveats

The README notes that generation may still be unstable due to limitations in data and training resources, suggesting users may need to generate multiple times or experiment with different decoding strategies.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

X-LLM by phellonchen

0.3%
312
Multimodal LLM research paper
created 2 years ago
updated 2 years ago
Feedback? Help us improve.