MM-Interleaved  by OpenGVLab

End-to-end generative AI for interleaved image-text data

Created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

MM-Interleaved is an end-to-end generative model designed for interleaved image-text data. It addresses the challenge of jointly modeling visual and textual information in a sequential, auto-regressive manner, enabling the generation of both accurate textual descriptions and visually coherent images. The project targets researchers and practitioners in multi-modal AI, offering a powerful foundation for tasks requiring integrated image and text understanding and generation.

How It Works

MM-Interleaved introduces a novel fine-grained multi-modal feature synchronizer (MMFS) that excels at recognizing multi-scale, high-resolution features across multiple images. This allows the model to process complex visual contexts and generate corresponding text, or vice-versa, with high fidelity. The auto-regressive generation process ensures that textual descriptions and visual outputs are consistent and contextually relevant, making it suitable for tasks like visual storytelling and image-text generation.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies via pip install -r requirements.txt, and compile the MultiScaleDeformableAttention module (cd mm_interleaved/models/utils/ops && python setup.py install).
  • Pretrained Models: Download model components from Hugging Face using python mm_interleaved/scripts/download_hf_models.py into the assets/ directory.
  • Inference: Run python -u inference.py --config_file=mm_interleaved/configs/release/mm_inference.yaml.
  • Prerequisites: Python environment, GPU (implied for deep learning tasks), and downloaded pretrained model weights.
  • Links: Repository: https://github.com/OpenGVLab/MM-Interleaved

Highlighted Details

  • Achieves superior zero-shot performance on various multi-modal comprehension and generation benchmarks.
  • Supports fine-tuning for diverse downstream tasks including visual question answering, image captioning, text-to-image generation, and visual storytelling.
  • Pretrained on a mixture of publicly available datasets.
  • Natively supports interleaved image and text generation with flexible input formats.

Maintenance & Community

The project has released inference, evaluation, and pre-training code, along with pre-trained model weights. Finetuning code is planned for future release. No specific community channels (e.g., Discord, Slack) or prominent maintainer/sponsor information is detailed in the README.

Licensing & Compatibility

The project is released under the Apache 2.0 license. However, pretrained model weights are subject to Llama's model license, which may impose restrictions on commercial use or redistribution. Users must consult both licenses for compatibility.

Limitations & Caveats

Finetuning code is not yet publicly available. The use of pretrained weights is constrained by Llama's license. Evaluation is configured by default for Slurm distributed environments, potentially requiring adaptation for single-machine setups. The README implies a need for specific data preparation and directory structures for evaluation and pre-training.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.