Uni-MoE by HITsz-TMG

Research paper on scaling unified multimodal LLMs with MoE

Created 1 year ago

1,066 stars

Top 35.5% on SourcePulse

Project Summary

Uni-MoE is a Mixture-of-Experts (MoE) based unified multimodal large language model (MLLM) capable of processing audio, speech, image, text, and video. It targets researchers and developers working on multimodal AI, offering a scalable architecture for handling diverse data types within a single model.

How It Works

The model employs a three-stage training process. First, it builds modality connectors to map diverse inputs into a unified language space. Second, modality-specific experts are trained using cross-modal data for deep understanding. Finally, these experts are integrated into an LLM backbone and refined using LoRA, enabling parallel processing at both expert and modality levels for enhanced scalability and efficiency.

Quick Start & Requirements

Installation: Clone the repository, activate a conda environment with Python 3.9.16, and install dependencies via pip install -r env.txt.
Prerequisites: CUDA version >= 11.7.
Weights: Download and organize specified checkpoints for vision, speech, audio, and Uni-MoE models.
Configuration: Replace all absolute path placeholders with actual file paths.
Resources: Recommended 80GB GPU RAM for experiments.
Links: Project Page, Demo Video, Paper, Hugging Face.

Highlighted Details

Supports diverse modalities: audio, speech, image, text, and video.
Scalable architecture using Mixture-of-Experts.
Three-stage training pipeline for robust multimodal understanding.
Released Uni-MoE-v2 with 8 experts and enhanced multi-node/GPU training scripts.
Introduced VideoVista benchmark and VideoVista-Train dataset.

Maintenance & Community

Paper accepted by IEEE TPAMI (2025).
Active development with recent releases of v2 checkpoints and training scripts.
Links to project page and demo available.

Licensing & Compatibility

Data and checkpoints are intended and licensed for research use only.
Restrictions follow LLaMA and Vicuna license agreements; commercial use is prohibited.

Limitations & Caveats

The project's data and checkpoints are strictly limited to research purposes and cannot be used commercially due to licensing restrictions inherited from LLaMA and Vicuna.

Uni-MoE by HITsz-TMG

Explore Similar Projects

bc-omni by westlake-baichuan-mllm

Awesome-Multimodal-LLM by HenryHZY

VARGPT by VARGPT-family

Vitron by SkyworkAI

BakLLaVA by SkunkworksAI

Aria by rhymes-ai

TinyLLaVA_Factory by TinyLLaVA

Video-LLaVA by PKU-YuanGroup

NExT-GPT by NExT-GPT

LLaVA-NeXT by LLaVA-VL

align-anything by PKU-Alignment

LLaVA by haotian-liu