Otter is a multi-modal large language model (LMM) designed for instruction following and in-context learning with images and videos. It is based on the OpenFlamingo architecture and trained on the MIMIC-IT dataset, offering an open-source alternative for researchers and developers working with vision-language tasks.
How It Works
Otter leverages the Flamingo architecture, which excels at processing multiple interleaved image and text inputs. It is trained using an in-context instruction tuning methodology on the MIMIC-IT dataset, which comprises 2.8 million instruction-response pairs. This approach enables Otter to understand and respond to natural language instructions related to visual content, including complex reasoning and multi-round conversations.
Quick Start & Requirements
- Install:
conda env create -f environment.yml
- Prerequisites: PyTorch matching CUDA version (e.g., CUDA 11.7 with Torch 2.0.0), transformers>=4.28.0, accelerate>=0.18.0. Requires at least 16GB GPU memory for local execution.
- Resources: Official Huggingface integration available.
- Docs: MIMIC-IT Dataset README, Run Otter Locally
Highlighted Details
- Introduces OtterHD, a fine-tuned version of Fuyu-8B for high-resolution image interpretation without an explicit vision encoder.
- Supports multiple interleaved image/video inputs, a novel feature for instruction-tuned LMMs.
- Includes MagnifierBench for evaluating the identification of small objects and spatial relationships.
- Provides training scripts for various LMMs (OpenFlamingo, Idefics, Fuyu) and datasets (MIMIC-IT, M3IT, LLAVAR).
Maintenance & Community
- Actively updated with new models (OtterHD) and benchmarks (MagnifierBench).
- Welcomes suggestions and PRs for code improvement.
- Contact available for custom scenario development.
Licensing & Compatibility
- The project itself appears to be under a permissive license, but specific model weights and datasets may have different terms. The README does not explicitly state the license for the code.
Limitations & Caveats
- The code is noted as potentially not perfectly polished.
- Previous code versions may not be runnable due to major changes in dataset organization.
- Requires careful environment setup to match CUDA and PyTorch versions.