Research paper on scaling unified multimodal LLMs with MoE
Top 47.6% on sourcepulse
Uni-MoE is a Mixture-of-Experts (MoE) based unified multimodal large language model (MLLM) capable of processing audio, speech, image, text, and video. It targets researchers and developers working on multimodal AI, offering a scalable architecture for handling diverse data types within a single model.
How It Works
The model employs a three-stage training process. First, it builds modality connectors to map diverse inputs into a unified language space. Second, modality-specific experts are trained using cross-modal data for deep understanding. Finally, these experts are integrated into an LLM backbone and refined using LoRA, enabling parallel processing at both expert and modality levels for enhanced scalability and efficiency.
Quick Start & Requirements
conda
environment with Python 3.9.16, and install dependencies via pip install -r env.txt
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's data and checkpoints are strictly limited to research purposes and cannot be used commercially due to licensing restrictions inherited from LLaMA and Vicuna.
2 months ago
1 week