InternLM-XComposer  by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

created 1 year ago
2,877 stars

Top 16.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

InternLM-XComposer2.5 is a versatile large vision-language model designed for advanced text-image comprehension and composition. It targets researchers and developers working with multimodal AI, offering capabilities for understanding long-contextual inputs, high-resolution images, and streaming video/audio. The system achieves GPT-4V level performance with a 7B LLM backend, outperforming many open-source models on 28 benchmarks.

How It Works

InternLM-XComposer2.5 utilizes a 7B LLM backend and a native 560x560 ViT vision encoder, enabling it to process high-resolution images with any aspect ratio. It handles 24K interleaved image-text contexts and can extend to 96K via RoPE extrapolation. For video, it treats frames as a high-resolution composite picture, allowing for fine-grained understanding through dense sampling. The model supports multi-turn, multi-image dialogue and can generate webpages from instructions or screenshots.

Quick Start & Requirements

  • Install: pip install internlm-xcomposer (or use transformers library).
  • Prerequisites: Python 3.8+, PyTorch 1.12+ (2.0+ recommended), CUDA 11.4+ (recommended for GPU). flash-attention2 is required for high-resolution usage.
  • Demo: Examples for video understanding, multi-image dialogue, high-resolution image analysis, and webpage generation are provided.
  • Inference: Supports Hugging Face Transformers and ModelScope Swift. LMDeploy is recommended for acceleration, with 4-bit quantization available.
  • Docs: README.md, README_CN.md, Technical Report

Highlighted Details

  • Achieves GPT-4V/Gemini Pro level performance on 16 key tasks.
  • Supports up to 96K context length via RoPE extrapolation.
  • Native 560x560 ViT for ultra-high-resolution image understanding.
  • Fine-grained video understanding by treating videos as high-resolution composite pictures.
  • Capable of generating webpages from instructions or screenshots.

Maintenance & Community

  • Active development with recent releases of XComposer2.5-Reward and XComposer2.5-OmniLive.
  • Support for ModelScope Swift and LMDeploy for finetuning and inference.
  • Community channels: Discord, WeChat.

Licensing & Compatibility

  • Code licensed under Apache-2.0.
  • Model weights are open for academic research and free commercial usage. Commercial license applications are available.

Limitations & Caveats

  • flash-attention2 is a requirement for high-resolution usage, which may add complexity to setup.
  • While performance is strong, specific benchmarks and comparisons are detailed in the technical report.
Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
68 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.