Monkey  by Yuliang-Liu

Research paper on multimodal models, image resolution, and text labels

Created 1 year ago
1,919 stars

Top 22.8% on SourcePulse

GitHubView on GitHub
Project Summary

Monkey is a multimodal large language model (LMM) framework focused on improving performance through image resolution and text label optimization. It targets researchers and developers working on advanced vision-language tasks, offering enhanced capabilities for image understanding and generation.

How It Works

Monkey's core innovation lies in its approach to handling image data for LMMs. It emphasizes the importance of both high-resolution image inputs and precise text labels, suggesting these are critical factors for model performance. The framework likely incorporates techniques for efficient high-resolution image processing and potentially novel methods for text-image alignment or data augmentation.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.9, and install dependencies via pip install -r requirements.txt. Specific flash_attention versions may be required.
  • Prerequisites: Python 3.9, CUDA, and potentially specific flash_attention builds.
  • Demo: An online demo is available at http://vlrlab-monkey.xyz:7681. Offline demo setup involves downloading model weights and modifying demo.py.
  • Docs: Model Weight, Detailed Caption Dataset.

Highlighted Details

  • Achieved fifth rank in the Multimodal Model category on OpenCompass (as of Jan 2024).
  • Nominated as a CVPR 2024 Highlight paper.
  • Supports multiple model variants including Monkey, TextMonkey, and Mini-Monkey.
  • Offers comprehensive evaluation code for 14 VQA datasets.

Maintenance & Community

The project is actively developed with recent papers accepted to NeurIPS 2024 and ICLR 2025. Key contributors are affiliated with HUST-VLRLab.

Licensing & Compatibility

The project is licensed under Apache 2.0. However, the README explicitly states it is intended for non-commercial use only. Commercial inquiries require direct contact with Prof. Yuliang Liu.

Limitations & Caveats

The primary limitation is the non-commercial use restriction. While benchmarks are provided, specific hardware requirements for optimal performance or training are not detailed beyond the need for CUDA and specific flash-attention builds.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
57 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.