InternVL by OpenGVLab

Open-source MLLM alternative to GPT-4o

Created 2 years ago

9,691 stars

Top 5.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

The InternVL family provides a suite of open-source multimodal large language models (MLLMs) designed to rival the performance of commercial models like GPT-4o. Targeting researchers and developers, it offers a scalable solution for tasks requiring visual understanding and language generation, with recent versions achieving state-of-the-art results on various benchmarks.

How It Works

InternVL models integrate powerful vision encoders (like InternViT) with large language models (LLMs) such as Qwen or Llama. The architecture supports various configurations, from smaller, efficient models to large-scale versions, enabling flexibility in deployment and performance. Key innovations include advanced training techniques like Mixed Preference Optimization (MPO) and novel positional encodings, contributing to improved reasoning and perception capabilities across diverse multimodal tasks.

Quick Start & Requirements

Installation: Primarily via Hugging Face transformers library.
Prerequisites: Python, PyTorch. GPU with sufficient VRAM is highly recommended for performance. Specific model requirements vary by size.
Resources: Models range from 1B to 78B parameters. Larger models require significant GPU memory (e.g., 78B models may need multiple high-VRAM GPUs).
Documentation: Comprehensive guides are available at https://internvl.readthedocs.io/en/latest/.

Highlighted Details

InternVL2.5-78B achieves over 70% on the MMMU benchmark, matching leading closed-source models.
Mini-InternVL models offer high performance with significantly reduced parameter counts.
Supports multilingual capabilities and advanced features like 4K image processing and strong OCR.
Includes dedicated models for CLIP-like cross-modal retrieval (InternVL-C) and generative tasks (InternVL-G).

Maintenance & Community

The project is actively developed by OpenGVLab, with frequent updates and new model releases. Community engagement is encouraged via WeChat groups.

Licensing & Compatibility

Released under the MIT license, allowing for broad use, including commercial applications. Some components may inherit licenses from their original sources.

Limitations & Caveats

While highly performant, larger models require substantial computational resources. Video and PDF input support in the online demo is listed as a future development item.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

108 stars in the last 30 days