Otter  by EvolvingLMMs-Lab

Multimodal model for improved instruction following and in-context learning

created 2 years ago
3,264 stars

Top 15.2% on sourcepulse

GitHubView on GitHub
Project Summary

Otter is a multi-modal large language model (LMM) designed for instruction following and in-context learning with images and videos. It is based on the OpenFlamingo architecture and trained on the MIMIC-IT dataset, offering an open-source alternative for researchers and developers working with vision-language tasks.

How It Works

Otter leverages the Flamingo architecture, which excels at processing multiple interleaved image and text inputs. It is trained using an in-context instruction tuning methodology on the MIMIC-IT dataset, which comprises 2.8 million instruction-response pairs. This approach enables Otter to understand and respond to natural language instructions related to visual content, including complex reasoning and multi-round conversations.

Quick Start & Requirements

  • Install: conda env create -f environment.yml
  • Prerequisites: PyTorch matching CUDA version (e.g., CUDA 11.7 with Torch 2.0.0), transformers>=4.28.0, accelerate>=0.18.0. Requires at least 16GB GPU memory for local execution.
  • Resources: Official Huggingface integration available.
  • Docs: MIMIC-IT Dataset README, Run Otter Locally

Highlighted Details

  • Introduces OtterHD, a fine-tuned version of Fuyu-8B for high-resolution image interpretation without an explicit vision encoder.
  • Supports multiple interleaved image/video inputs, a novel feature for instruction-tuned LMMs.
  • Includes MagnifierBench for evaluating the identification of small objects and spatial relationships.
  • Provides training scripts for various LMMs (OpenFlamingo, Idefics, Fuyu) and datasets (MIMIC-IT, M3IT, LLAVAR).

Maintenance & Community

  • Actively updated with new models (OtterHD) and benchmarks (MagnifierBench).
  • Welcomes suggestions and PRs for code improvement.
  • Contact available for custom scenario development.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but specific model weights and datasets may have different terms. The README does not explicitly state the license for the code.

Limitations & Caveats

  • The code is noted as potentially not perfectly polished.
  • Previous code versions may not be runnable due to major changes in dataset organization.
  • Requires careful environment setup to match CUDA and PyTorch versions.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.