InternVideo by OpenGVLab

Video foundation models & data for multimodal understanding (research paper)

Created 3 years ago

2,163 stars

Top 20.6% on SourcePulse

1 Expert Loves This Project

jn2clark

Cofounder of Marqo

Project Summary

This repository provides a suite of video foundation models and datasets designed for multimodal understanding and generation. Targeting researchers and developers in computer vision and AI, it offers scalable models and large-scale datasets to advance video-centric AI capabilities.

How It Works

The InternVideo series employs a dual approach of generative and discriminative learning to build comprehensive video understanding models. InternVideo2 scales these models for multimodal tasks, while InternVideo2.5 enhances context modeling for longer, richer video content. The project also includes InternVid, a large-scale video-text dataset, facilitating both understanding and generation tasks.

Quick Start & Requirements

Installation and usage details are available in the official documentation.
Requires Python and relevant deep learning libraries. Specific hardware requirements (e.g., GPUs) may apply depending on the model size.
Links: Official Documentation, HuggingFace Models

Highlighted Details

Offers a range of model sizes, including smaller distilled versions like InternVideo2-S/B/L and larger 8B parameter models.
Includes InternVid, a large-scale video-text dataset with 230 million video-text pairs.
Supports video instruction tuning for multimodal dialogue systems like VideoChat.
Models and datasets are available on HuggingFace.

Maintenance & Community

Actively updated with new releases like InternVideo2.5.
Community discussion via WeChat groups.
Hiring for researchers and engineers in video foundation models.

Licensing & Compatibility

The specific license is not explicitly stated in the provided README snippet. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail licensing, which may impact commercial adoption. Specific hardware requirements for larger models are not detailed.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

1

Star History

34 stars in the last 30 days

Explore Similar Projects

Awesome_Long_Form_Video_Understanding by ttengwang

Curated list of research on long-term video understanding

Created 3 years ago

Updated 3 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

MotionLLM by IDEA-Research

MotionLLM: Research paper for multimodal human behavior understanding

Created 1 year ago

Updated 1 year ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

unmasked_teacher by OpenGVLab

Research paper for training-efficient video foundation models

Created 2 years ago

Updated 1 year ago

VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

Created 1 year ago

Updated 5 months ago

Video-MME by MME-Benchmarks

Evaluation benchmark for multimodal LLMs in video analysis

Created 1 year ago

Updated 1 month ago

VideoWorld by ByteDance-Seed

Generative model for knowledge learning from unlabeled videos (CVPR 2025 paper)

Created 1 year ago

Updated 5 months ago

Eagle by NVlabs

Vision-language model for long-context multimodal learning

Created 1 year ago

Updated 2 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

Updated 1 year ago

VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

Created 11 months ago

Updated 5 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

VideoRAG by HKUDS

PyTorch code for retrieval-augmented generation with long-context videos

Created 11 months ago

Updated 2 weeks ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 8 months ago

Updated 2 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.