lens  by ContextualAI

Vision-language research paper using LLMs

created 2 years ago
352 stars

Top 80.3% on sourcepulse

GitHubView on GitHub
Project Summary

LENS (Large Language Models Enhanced to See) provides a system for leveraging large language models (LLMs) for computer vision tasks by first generating rich natural language descriptions of images. This approach targets researchers and developers seeking to integrate vision capabilities into LLMs without requiring model fine-tuning, offering competitive performance against state-of-the-art models.

How It Works

LENS processes images through a suite of vision modules that output detailed natural language captions, tags, objects, and attributes. These textual descriptions are then fed into an LLM, enabling it to perform various vision-related tasks. This method avoids the need for fine-tuning LLMs on visual data, simplifying integration and potentially improving performance through the LLM's inherent language understanding capabilities.

Quick Start & Requirements

  • Install via pip: pip install llm-lens
  • Recommended: Machine with GPUs and CUDA. CPU-only is functional but slower for large datasets.
  • Python 3.9 environment.
  • Official Demo: [Demo]
  • Official Colab: [Colab]

Highlighted Details

  • Generates natural language descriptions for images to be used as input for LLMs.
  • Achieves competitive performance against models like Flamingo, CLIP, and Kosmos without LLM fine-tuning.
  • Supports augmenting Hugging Face datasets with visual descriptions.
  • Future additions include evaluation scripts and vocabulary generation for paper reproducibility.

Maintenance & Community

  • Project is associated with the paper "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" (arXiv:2306.16410).
  • Links to official blog and paper are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

The repository is marked as "Coming Soon" for several key features, including evaluation on standard datasets and reproduction scripts for the paper's methodology, indicating it may be in an early development stage.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.