InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

2,906 stars

Top 16.3% on SourcePulse

View on GitHub

5 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Thomas Wolf

Cofounder of Hugging Face

Abubakar Abid

Cofounder of Gradio

Luis Capelo

Cofounder of Lightning AI

and 1 more!

Project Summary

InternLM-XComposer2.5 is a versatile large vision-language model designed for advanced text-image comprehension and composition. It targets researchers and developers working with multimodal AI, offering capabilities for understanding long-contextual inputs, high-resolution images, and streaming video/audio. The system achieves GPT-4V level performance with a 7B LLM backend, outperforming many open-source models on 28 benchmarks.

How It Works

InternLM-XComposer2.5 utilizes a 7B LLM backend and a native 560x560 ViT vision encoder, enabling it to process high-resolution images with any aspect ratio. It handles 24K interleaved image-text contexts and can extend to 96K via RoPE extrapolation. For video, it treats frames as a high-resolution composite picture, allowing for fine-grained understanding through dense sampling. The model supports multi-turn, multi-image dialogue and can generate webpages from instructions or screenshots.

Quick Start & Requirements

Install: pip install internlm-xcomposer (or use transformers library).
Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (recommended for GPU). flash-attention2 is required for high-resolution usage.
Demo: Examples for video understanding, multi-image dialogue, high-resolution image analysis, and webpage generation are provided.
Inference: Supports Hugging Face Transformers and ModelScope Swift. LMDeploy is recommended for acceleration, with 4-bit quantization available.
Docs: README.md, README_CN.md, Technical Report

Highlighted Details

Achieves GPT-4V/Gemini Pro level performance on 16 key tasks.
Supports up to 96K context length via RoPE extrapolation.
Native 560x560 ViT for ultra-high-resolution image understanding.
Fine-grained video understanding by treating videos as high-resolution composite pictures.
Capable of generating webpages from instructions or screenshots.

Maintenance & Community

Active development with recent releases of XComposer2.5-Reward and XComposer2.5-OmniLive.
Support for ModelScope Swift and LMDeploy for finetuning and inference.
Community channels: Discord, WeChat.

Licensing & Compatibility

Code licensed under Apache-2.0.
Model weights are open for academic research and free commercial usage. Commercial license applications are available.

Limitations & Caveats

flash-attention2 is a requirement for high-resolution usage, which may add complexity to setup.
While performance is strong, specific benchmarks and comparisons are detailed in the technical report.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days