Monkey by Yuliang-Liu

Research paper on multimodal models, image resolution, and text labels

Created 2 years ago

1,941 stars

Top 22.4% on SourcePulse

Project Summary

Monkey is a multimodal large language model (LMM) framework focused on improving performance through image resolution and text label optimization. It targets researchers and developers working on advanced vision-language tasks, offering enhanced capabilities for image understanding and generation.

How It Works

Monkey's core innovation lies in its approach to handling image data for LMMs. It emphasizes the importance of both high-resolution image inputs and precise text labels, suggesting these are critical factors for model performance. The framework likely incorporates techniques for efficient high-resolution image processing and potentially novel methods for text-image alignment or data augmentation.

Quick Start & Requirements

Install: Clone the repository, create a conda environment with Python 3.9, and install dependencies via pip install -r requirements.txt. Specific flash_attention versions may be required.
Prerequisites: Python 3.9, CUDA, and potentially specific flash_attention builds.
Demo: An online demo is available at http://vlrlab-monkey.xyz:7681. Offline demo setup involves downloading model weights and modifying demo.py.
Docs: Model Weight, Detailed Caption Dataset.

Highlighted Details

Achieved fifth rank in the Multimodal Model category on OpenCompass (as of Jan 2024).
Nominated as a CVPR 2024 Highlight paper.
Supports multiple model variants including Monkey, TextMonkey, and Mini-Monkey.
Offers comprehensive evaluation code for 14 VQA datasets.

Maintenance & Community

The project is actively developed with recent papers accepted to NeurIPS 2024 and ICLR 2025. Key contributors are affiliated with HUST-VLRLab.

Licensing & Compatibility

The project is licensed under Apache 2.0. However, the README explicitly states it is intended for non-commercial use only. Commercial inquiries require direct contact with Prof. Yuliang Liu.

Limitations & Caveats

The primary limitation is the non-commercial use restriction. While benchmarks are provided, specific hardware requirements for optimal performance or training are not detailed beyond the need for CUDA and specific flash-attention builds.

Monkey by Yuliang-Liu

Explore Similar Projects

TokenFlow by ByteVisionLab

X-Omni by X-Omni-Team

UltraPixel by catcathh

SkyPaint-AI-Diffusion by SkyWorkAIGC

karlo by kakaobrain

HunyuanImage-2.1 by Tencent-Hunyuan

Image2Paragraph by showlab

GLIGEN by gligen

BallonsTranslator by dmMaze

Qwen-Image by QwenLM

DeepSeek-OCR by deepseek-ai

stable-diffusion by CompVis