tarsier  by reworkd

Vision utilities SDK for web interaction agents

created 1 year ago
1,712 stars

Top 25.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Tarsier provides vision utilities for web interaction agents, addressing the challenge of feeding webpage information to LLMs and mapping LLM responses back to web elements. It targets developers building autonomous agents that navigate and interact with websites, offering a system to represent visual page structure for LLMs.

How It Works

Tarsier visually tags interactable elements (buttons, links, inputs) with bracketed IDs (e.g., [23]) to create a mapping for LLM actions. It also includes an OCR algorithm that converts screenshots into a structured, whitespace-preserving text representation, akin to ASCII art. This approach aims to provide LLMs, even text-only ones, with a detailed understanding of a webpage's visual layout and interactive components, reportedly outperforming vision-language models on specific benchmarks.

Quick Start & Requirements

  • Install via pip: pip install tarsier
  • Requires cloud OCR service credentials (Google Cloud Vision or Microsoft Azure Computer Vision).
  • Example usage and agent cookbook available: https://github.com/reworkd/tarsier

Highlighted Details

  • Tags elements with specific prefixes: [#ID] for text inputs, [@ID] for links, [$ID] for other interactables.
  • OCR algorithm converts screenshots to a structured text format for LLMs.
  • Claims unimodal GPT-4 + Tarsier-Text outperforms GPT-4V + Tarsier-Screenshot by 10-20% on internal benchmarks.
  • Supports Google Cloud Vision and plans to add Amazon Textract and Microsoft Azure Computer Vision.

Maintenance & Community

  • Active development with a roadmap including documentation, examples, and interface cleanup.
  • Community support via Discord.

Licensing & Compatibility

  • License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README does not specify the project's license, which is a critical factor for commercial adoption or integration into closed-source projects. Support for Microsoft Azure Computer Vision is listed as "Coming Soon" despite example code and setup instructions for it being present.

Health Check
Last commit

8 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
58 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Feedback? Help us improve.