Otter  by EvolvingLMMs-Lab

Multimodal model for improved instruction following and in-context learning

Created 2 years ago
3,270 stars

Top 14.8% on SourcePulse

GitHubView on GitHub
Project Summary

Otter is a multi-modal large language model (LMM) designed for instruction following and in-context learning with images and videos. It is based on the OpenFlamingo architecture and trained on the MIMIC-IT dataset, offering an open-source alternative for researchers and developers working with vision-language tasks.

How It Works

Otter leverages the Flamingo architecture, which excels at processing multiple interleaved image and text inputs. It is trained using an in-context instruction tuning methodology on the MIMIC-IT dataset, which comprises 2.8 million instruction-response pairs. This approach enables Otter to understand and respond to natural language instructions related to visual content, including complex reasoning and multi-round conversations.

Quick Start & Requirements

  • Install: conda env create -f environment.yml
  • Prerequisites: PyTorch matching CUDA version (e.g., CUDA 11.7 with Torch 2.0.0), transformers>=4.28.0, accelerate>=0.18.0. Requires at least 16GB GPU memory for local execution.
  • Resources: Official Huggingface integration available.
  • Docs: MIMIC-IT Dataset README, Run Otter Locally

Highlighted Details

  • Introduces OtterHD, a fine-tuned version of Fuyu-8B for high-resolution image interpretation without an explicit vision encoder.
  • Supports multiple interleaved image/video inputs, a novel feature for instruction-tuned LMMs.
  • Includes MagnifierBench for evaluating the identification of small objects and spatial relationships.
  • Provides training scripts for various LMMs (OpenFlamingo, Idefics, Fuyu) and datasets (MIMIC-IT, M3IT, LLAVAR).

Maintenance & Community

  • Actively updated with new models (OtterHD) and benchmarks (MagnifierBench).
  • Welcomes suggestions and PRs for code improvement.
  • Contact available for custom scenario development.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but specific model weights and datasets may have different terms. The README does not explicitly state the license for the code.

Limitations & Caveats

  • The code is noted as potentially not perfectly polished.
  • Previous code versions may not be runnable due to major changes in dataset organization.
  • Requires careful environment setup to match CUDA and PyTorch versions.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.