Aria  by rhymes-ai

Multimodal MoE model for video, document understanding, and dialog

created 10 months ago
1,060 stars

Top 36.3% on sourcepulse

GitHubView on GitHub
Project Summary

Aria is an open-source multimodal native Mixture-of-Experts (MoE) model designed for advanced language and vision tasks, particularly excelling in video and document understanding. It targets researchers and developers seeking state-of-the-art performance with a long context window and efficient inference.

How It Works

Aria employs a Mixture-of-Experts (MoE) architecture, activating 3.9B parameters per token for fast inference and cost-effective fine-tuning. It supports a 64K token multimodal context window, enabling comprehensive understanding of extended inputs. The model is integrated with Hugging Face Transformers for ease of use and offers compatibility with vLLM for enhanced performance.

Quick Start & Requirements

  • Install: pip install -e . (or pip install -e .[dev] for development). Additional installs: pip install grouped_gemm, pip install flash-attn --no-build-isolation.
  • Requirements: Python, PyTorch, Hugging Face Transformers. GPU with at least 80GB VRAM (e.g., A100) recommended for inference and fine-tuning. bfloat16 precision is used.
  • Resources: Inference requires one A100 (80GB) GPU. Fine-tuning can be done with LoRA on a single A100/H100 (80GB) or full parameter tuning on multiple A100 (80GB) GPUs.
  • Docs: Hugging Face, Paper, Blog, WebDemo.

Highlighted Details

  • State-of-the-art performance in video and document understanding.
  • Long multimodal context window of 64K tokens.
  • 3.9B activated parameters per token for efficient inference.
  • Offers both LoRA and full parameter fine-tuning capabilities.

Maintenance & Community

  • Active development with recent releases including Aria-Chat and base models.
  • Community support via Discord.
  • Discord

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Fine-tuning requires specific versions of the transformers library (v4.45.0) and a particular model revision ("4844f0b5ff678e768236889df5accbe4967ec845") due to weight mapping changes. Memory requirements for fine-tuning vary significantly with dataset type.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.