tarsier by reworkd

Vision utilities SDK for web interaction agents

Created 2 years ago

1,751 stars

Top 24.3% on SourcePulse

View on GitHub

4 Experts Love This Project

Project Summary

Tarsier provides vision utilities for web interaction agents, addressing the challenge of feeding webpage information to LLMs and mapping LLM responses back to web elements. It targets developers building autonomous agents that navigate and interact with websites, offering a system to represent visual page structure for LLMs.

How It Works

Tarsier visually tags interactable elements (buttons, links, inputs) with bracketed IDs (e.g., [23]) to create a mapping for LLM actions. It also includes an OCR algorithm that converts screenshots into a structured, whitespace-preserving text representation, akin to ASCII art. This approach aims to provide LLMs, even text-only ones, with a detailed understanding of a webpage's visual layout and interactive components, reportedly outperforming vision-language models on specific benchmarks.

Quick Start & Requirements

Install via pip: pip install tarsier
Requires cloud OCR service credentials (Google Cloud Vision or Microsoft Azure Computer Vision).
Example usage and agent cookbook available: https://github.com/reworkd/tarsier

Highlighted Details

Tags elements with specific prefixes: [#ID] for text inputs, [@ID] for links, [$ID] for other interactables.
OCR algorithm converts screenshots to a structured text format for LLMs.
Claims unimodal GPT-4 + Tarsier-Text outperforms GPT-4V + Tarsier-Screenshot by 10-20% on internal benchmarks.
Supports Google Cloud Vision and plans to add Amazon Textract and Microsoft Azure Computer Vision.

Maintenance & Community

Active development with a roadmap including documentation, examples, and interface cleanup.
Community support via Discord.

Licensing & Compatibility

License not explicitly stated in the README. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README does not specify the project's license, which is a critical factor for commercial adoption or integration into closed-source projects. Support for Microsoft Azure Computer Vision is listed as "Coming Soon" despite example code and setup instructions for it being present.

Health Check

Last Commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days