Multimodal MoE model for video, document understanding, and dialog
Top 36.3% on sourcepulse
Aria is an open-source multimodal native Mixture-of-Experts (MoE) model designed for advanced language and vision tasks, particularly excelling in video and document understanding. It targets researchers and developers seeking state-of-the-art performance with a long context window and efficient inference.
How It Works
Aria employs a Mixture-of-Experts (MoE) architecture, activating 3.9B parameters per token for fast inference and cost-effective fine-tuning. It supports a 64K token multimodal context window, enabling comprehensive understanding of extended inputs. The model is integrated with Hugging Face Transformers for ease of use and offers compatibility with vLLM for enhanced performance.
Quick Start & Requirements
pip install -e .
(or pip install -e .[dev]
for development). Additional installs: pip install grouped_gemm
, pip install flash-attn --no-build-isolation
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Fine-tuning requires specific versions of the transformers
library (v4.45.0) and a particular model revision ("4844f0b5ff678e768236889df5accbe4967ec845") due to weight mapping changes. Memory requirements for fine-tuning vary significantly with dataset type.
6 months ago
1 day