Monkey  by Yuliang-Liu

Research paper on multimodal models, image resolution, and text labels

created 1 year ago
1,901 stars

Top 23.4% on sourcepulse

GitHubView on GitHub
Project Summary

Monkey is a multimodal large language model (LMM) framework focused on improving performance through image resolution and text label optimization. It targets researchers and developers working on advanced vision-language tasks, offering enhanced capabilities for image understanding and generation.

How It Works

Monkey's core innovation lies in its approach to handling image data for LMMs. It emphasizes the importance of both high-resolution image inputs and precise text labels, suggesting these are critical factors for model performance. The framework likely incorporates techniques for efficient high-resolution image processing and potentially novel methods for text-image alignment or data augmentation.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.9, and install dependencies via pip install -r requirements.txt. Specific flash_attention versions may be required.
  • Prerequisites: Python 3.9, CUDA, and potentially specific flash_attention builds.
  • Demo: An online demo is available at http://vlrlab-monkey.xyz:7681. Offline demo setup involves downloading model weights and modifying demo.py.
  • Docs: Model Weight, Detailed Caption Dataset.

Highlighted Details

  • Achieved fifth rank in the Multimodal Model category on OpenCompass (as of Jan 2024).
  • Nominated as a CVPR 2024 Highlight paper.
  • Supports multiple model variants including Monkey, TextMonkey, and Mini-Monkey.
  • Offers comprehensive evaluation code for 14 VQA datasets.

Maintenance & Community

The project is actively developed with recent papers accepted to NeurIPS 2024 and ICLR 2025. Key contributors are affiliated with HUST-VLRLab.

Licensing & Compatibility

The project is licensed under Apache 2.0. However, the README explicitly states it is intended for non-commercial use only. Commercial inquiries require direct contact with Prof. Yuliang Liu.

Limitations & Caveats

The primary limitation is the non-commercial use restriction. While benchmarks are provided, specific hardware requirements for optimal performance or training are not detailed beyond the need for CUDA and specific flash-attention builds.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
163 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.