Vision-language model for long-context multimodal learning
Top 43.2% on sourcepulse
The Eagle family of models addresses the challenge of long-context multimodal learning, offering solutions for comprehending extended video sequences and high-resolution images. Targeting researchers and developers in computer vision and natural language processing, Eagle 2.5 provides a generalist framework that significantly enhances performance on long-context benchmarks, rivaling larger commercial models with fewer parameters.
How It Works
Eagle 2.5 employs Automatic Degrade Sampling (ADS) and Image Area Preservation (IAP) to maintain contextual integrity and visual detail during long-context training. ADS dynamically balances visual and textual inputs, while IAP optimizes image tiling to retain original aspect ratios and fine-grained details. The training pipeline also utilizes progressive mixed post-training to gradually increase context length, improving information density. A key component is the Eagle-Video-110K dataset, curated for long video understanding with story-level and clip-level annotations.
Quick Start & Requirements
pip install transformers==4.37.2 flash-attn
nvidia/Eagle2-1B
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model weights are restricted to non-commercial use. The README mentions TODO items for vLLM inference support and AWQ quantization weights, indicating these features are not yet available.
3 months ago
1 day