Eagle by NVlabs

Vision-language model for long-context multimodal learning

Created 1 year ago

917 stars

Top 39.7% on SourcePulse

Project Summary

The Eagle family of models addresses the challenge of long-context multimodal learning, offering solutions for comprehending extended video sequences and high-resolution images. Targeting researchers and developers in computer vision and natural language processing, Eagle 2.5 provides a generalist framework that significantly enhances performance on long-context benchmarks, rivaling larger commercial models with fewer parameters.

How It Works

Eagle 2.5 employs Automatic Degrade Sampling (ADS) and Image Area Preservation (IAP) to maintain contextual integrity and visual detail during long-context training. ADS dynamically balances visual and textual inputs, while IAP optimizes image tiling to retain original aspect ratios and fine-grained details. The training pipeline also utilizes progressive mixed post-training to gradually increase context length, improving information density. A key component is the Eagle-Video-110K dataset, curated for long video understanding with story-level and clip-level annotations.

Quick Start & Requirements

Install: pip install transformers==4.37.2 flash-attn
Prerequisites: CUDA, Python. Specific GPU memory requirements are not detailed but implied by model sizes (1B to 9B parameters).
Demo: A Streamlit-based local chat demo is available.
Models: Available on Hugging Face (e.g., nvidia/Eagle2-1B).
Documentation: Links to papers, Hugging Face models, and demos are provided.

Highlighted Details

Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, competitive with GPT-4o and Qwen2.5-VL-72B.
Supports long-context multimodal learning for both video and high-resolution images.
Introduces novel training techniques: Automatic Degrade Sampling (ADS) and Image Area Preservation (IAP).
Features the Eagle-Video-110K dataset for long video understanding.

Maintenance & Community

The project is actively developed by NVlabs, with releases including Eagle-1, Eagle-2, and Eagle-2.5.
Eagle-1 was accepted to ICLR 2025.
No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

Code: Apache 2.0 license.
Model Weights: Creative Commons Attribution-NonCommercial 4.0 International.
Restrictions: The model weights are for non-commercial use only. Licenses for underlying models (Qwen2.5, Llama, PaliGemma) are also mentioned.

Limitations & Caveats

The model weights are restricted to non-commercial use. The README mentions TODO items for vLLM inference support and AWQ quantization weights, indicating these features are not yet available.

Eagle by NVlabs

Explore Similar Projects

MiraData by mira-space

VideoChat-Flash by OpenGVLab

ml-slowfast-llava by apple

LongVA by EvolvingLMMs-Lab

PixelRefer by alibaba-damo-academy

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

MiniGPT4-video by Vision-CAIR

VideoLLaMA3 by DAMO-NLP-SG

InternVideo by OpenGVLab

CogVLM2 by zai-org

VideoRAG by HKUDS