lens  by ContextualAI

Vision-language research paper using LLMs

Created 2 years ago
353 stars

Top 79.0% on SourcePulse

GitHubView on GitHub
Project Summary

LENS (Large Language Models Enhanced to See) provides a system for leveraging large language models (LLMs) for computer vision tasks by first generating rich natural language descriptions of images. This approach targets researchers and developers seeking to integrate vision capabilities into LLMs without requiring model fine-tuning, offering competitive performance against state-of-the-art models.

How It Works

LENS processes images through a suite of vision modules that output detailed natural language captions, tags, objects, and attributes. These textual descriptions are then fed into an LLM, enabling it to perform various vision-related tasks. This method avoids the need for fine-tuning LLMs on visual data, simplifying integration and potentially improving performance through the LLM's inherent language understanding capabilities.

Quick Start & Requirements

  • Install via pip: pip install llm-lens
  • Recommended: Machine with GPUs and CUDA. CPU-only is functional but slower for large datasets.
  • Python 3.9 environment.
  • Official Demo: [Demo]
  • Official Colab: [Colab]

Highlighted Details

  • Generates natural language descriptions for images to be used as input for LLMs.
  • Achieves competitive performance against models like Flamingo, CLIP, and Kosmos without LLM fine-tuning.
  • Supports augmenting Hugging Face datasets with visual descriptions.
  • Future additions include evaluation scripts and vocabulary generation for paper reproducibility.

Maintenance & Community

  • Project is associated with the paper "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" (arXiv:2306.16410).
  • Links to official blog and paper are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

The repository is marked as "Coming Soon" for several key features, including evaluation on standard datasets and reproduction scripts for the paper's methodology, indicating it may be in an early development stage.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.