UMOE-Scaling-Unified-Multimodal-LLMs  by HITsz-TMG

Research paper on scaling unified multimodal LLMs with MoE

Created 1 year ago
761 stars

Top 45.7% on SourcePulse

GitHubView on GitHub
Project Summary

Uni-MoE is a Mixture-of-Experts (MoE) based unified multimodal large language model (MLLM) capable of processing audio, speech, image, text, and video. It targets researchers and developers working on multimodal AI, offering a scalable architecture for handling diverse data types within a single model.

How It Works

The model employs a three-stage training process. First, it builds modality connectors to map diverse inputs into a unified language space. Second, modality-specific experts are trained using cross-modal data for deep understanding. Finally, these experts are integrated into an LLM backbone and refined using LoRA, enabling parallel processing at both expert and modality levels for enhanced scalability and efficiency.

Quick Start & Requirements

  • Installation: Clone the repository, activate a conda environment with Python 3.9.16, and install dependencies via pip install -r env.txt.
  • Prerequisites: CUDA version >= 11.7.
  • Weights: Download and organize specified checkpoints for vision, speech, audio, and Uni-MoE models.
  • Configuration: Replace all absolute path placeholders with actual file paths.
  • Resources: Recommended 80GB GPU RAM for experiments.
  • Links: Project Page, Demo Video, Paper, Hugging Face.

Highlighted Details

  • Supports diverse modalities: audio, speech, image, text, and video.
  • Scalable architecture using Mixture-of-Experts.
  • Three-stage training pipeline for robust multimodal understanding.
  • Released Uni-MoE-v2 with 8 experts and enhanced multi-node/GPU training scripts.
  • Introduced VideoVista benchmark and VideoVista-Train dataset.

Maintenance & Community

  • Paper accepted by IEEE TPAMI (2025).
  • Active development with recent releases of v2 checkpoints and training scripts.
  • Links to project page and demo available.

Licensing & Compatibility

  • Data and checkpoints are intended and licensed for research use only.
  • Restrictions follow LLaMA and Vicuna license agreements; commercial use is prohibited.

Limitations & Caveats

The project's data and checkpoints are strictly limited to research purposes and cannot be used commercially due to licensing restrictions inherited from LLaMA and Vicuna.

Health Check
Last Commit

15 hours ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.2%
24k
Multimodal assistant with GPT-4 level capabilities
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.