NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

3,605 stars

Top 13.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Project Summary

NExT-GPT is an end-to-end multimodal large language model capable of processing and generating arbitrary combinations of text, image, video, and audio. It is designed for researchers and developers working with multimodal AI, offering a unified framework for "any-to-any" content interaction.

How It Works

NExT-GPT integrates pre-trained components: a multimodal encoder (ImageBind), a large language model (Vicuna), and diffusion models for image, audio, and video generation. Inputs are encoded into language-like representations, processed by the LLM, which then generates text and special "modality signal" tokens. These signals guide output projection layers to condition the multimodal decoders for generating the desired output content. This approach allows for flexible, cross-modal generation driven by a central LLM.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda env create -n nextgpt python=3.8), activate it (conda activate nextgpt), install PyTorch with CUDA 12.1 support (conda install pytorch==2.1.2 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6), and then install requirements (pip install -r requirements.txt).
Prerequisites: CUDA 12.1, PyTorch 2.1.2, ImageBind (huge version), Vicuna-7b-v1.5, Stable Diffusion v2, AudioLDM (l-full), ZeroScope v2_576w. Datasets (CC3M, WebVid, AudioCap, LLaVA, Alpaca, VideoChat) need to be downloaded and preprocessed.
Resources: Requires significant disk space for checkpoints and datasets. Training involves multiple stages.
Links: Online Live Demo, Paper

Highlighted Details

Achieves "any-to-any" multimodal input/output capabilities.
Leverages ImageBind for unified encoding of image, video, and audio.
Integrates with Stable Diffusion, AudioLDM, and ZeroScope for generation.
Supports instruction tuning via LoRA for LLM adaptation.
Provides scripts for pre-training, fine-tuning, and prediction.

Maintenance & Community

The project is associated with the National University of Singapore and is based on the ICML 2024 paper. Contact information for Shengqiong Wu and Hao Fei is provided.

Licensing & Compatibility

BSD 3-Clause License. NExT-GPT is intended for non-commercial research use only. Commercial use requires explicit approval from the authors. Restrictions apply against illegal, harmful, violent, racist, or sexual purposes.

Limitations & Caveats

The project is primarily for research and non-commercial use. Commercial applications require author approval. The README notes ongoing work to support more LLM types/sizes and additional modalities.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days