InternLM-XComposer  by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago
2,894 stars

Top 16.4% on SourcePulse

GitHubView on GitHub
Project Summary

InternLM-XComposer2.5 is a versatile large vision-language model designed for advanced text-image comprehension and composition. It targets researchers and developers working with multimodal AI, offering capabilities for understanding long-contextual inputs, high-resolution images, and streaming video/audio. The system achieves GPT-4V level performance with a 7B LLM backend, outperforming many open-source models on 28 benchmarks.

How It Works

InternLM-XComposer2.5 utilizes a 7B LLM backend and a native 560x560 ViT vision encoder, enabling it to process high-resolution images with any aspect ratio. It handles 24K interleaved image-text contexts and can extend to 96K via RoPE extrapolation. For video, it treats frames as a high-resolution composite picture, allowing for fine-grained understanding through dense sampling. The model supports multi-turn, multi-image dialogue and can generate webpages from instructions or screenshots.

Quick Start & Requirements

  • Install: pip install internlm-xcomposer (or use transformers library).
  • Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (recommended for GPU). flash-attention2 is required for high-resolution usage.
  • Demo: Examples for video understanding, multi-image dialogue, high-resolution image analysis, and webpage generation are provided.
  • Inference: Supports Hugging Face Transformers and ModelScope Swift. LMDeploy is recommended for acceleration, with 4-bit quantization available.
  • Docs: README.md, README_CN.md, Technical Report

Highlighted Details

  • Achieves GPT-4V/Gemini Pro level performance on 16 key tasks.
  • Supports up to 96K context length via RoPE extrapolation.
  • Native 560x560 ViT for ultra-high-resolution image understanding.
  • Fine-grained video understanding by treating videos as high-resolution composite pictures.
  • Capable of generating webpages from instructions or screenshots.

Maintenance & Community

  • Active development with recent releases of XComposer2.5-Reward and XComposer2.5-OmniLive.
  • Support for ModelScope Swift and LMDeploy for finetuning and inference.
  • Community channels: Discord, WeChat.

Licensing & Compatibility

  • Code licensed under Apache-2.0.
  • Model weights are open for academic research and free commercial usage. Commercial license applications are available.

Limitations & Caveats

  • flash-attention2 is a requirement for high-resolution usage, which may add complexity to setup.
  • While performance is strong, specific benchmarks and comparisons are detailed in the technical report.
Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.