UMOE-Scaling-Unified-Multimodal-LLMs  by HITsz-TMG

Research paper on scaling unified multimodal LLMs with MoE

created 1 year ago
745 stars

Top 47.6% on sourcepulse

GitHubView on GitHub
Project Summary

Uni-MoE is a Mixture-of-Experts (MoE) based unified multimodal large language model (MLLM) capable of processing audio, speech, image, text, and video. It targets researchers and developers working on multimodal AI, offering a scalable architecture for handling diverse data types within a single model.

How It Works

The model employs a three-stage training process. First, it builds modality connectors to map diverse inputs into a unified language space. Second, modality-specific experts are trained using cross-modal data for deep understanding. Finally, these experts are integrated into an LLM backbone and refined using LoRA, enabling parallel processing at both expert and modality levels for enhanced scalability and efficiency.

Quick Start & Requirements

  • Installation: Clone the repository, activate a conda environment with Python 3.9.16, and install dependencies via pip install -r env.txt.
  • Prerequisites: CUDA version >= 11.7.
  • Weights: Download and organize specified checkpoints for vision, speech, audio, and Uni-MoE models.
  • Configuration: Replace all absolute path placeholders with actual file paths.
  • Resources: Recommended 80GB GPU RAM for experiments.
  • Links: Project Page, Demo Video, Paper, Hugging Face.

Highlighted Details

  • Supports diverse modalities: audio, speech, image, text, and video.
  • Scalable architecture using Mixture-of-Experts.
  • Three-stage training pipeline for robust multimodal understanding.
  • Released Uni-MoE-v2 with 8 experts and enhanced multi-node/GPU training scripts.
  • Introduced VideoVista benchmark and VideoVista-Train dataset.

Maintenance & Community

  • Paper accepted by IEEE TPAMI (2025).
  • Active development with recent releases of v2 checkpoints and training scripts.
  • Links to project page and demo available.

Licensing & Compatibility

  • Data and checkpoints are intended and licensed for research use only.
  • Restrictions follow LLaMA and Vicuna license agreements; commercial use is prohibited.

Limitations & Caveats

The project's data and checkpoints are strictly limited to research purposes and cannot be used commercially due to licensing restrictions inherited from LLaMA and Vicuna.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
31 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.