Vision utilities SDK for web interaction agents
Top 25.4% on sourcepulse
Tarsier provides vision utilities for web interaction agents, addressing the challenge of feeding webpage information to LLMs and mapping LLM responses back to web elements. It targets developers building autonomous agents that navigate and interact with websites, offering a system to represent visual page structure for LLMs.
How It Works
Tarsier visually tags interactable elements (buttons, links, inputs) with bracketed IDs (e.g., [23]
) to create a mapping for LLM actions. It also includes an OCR algorithm that converts screenshots into a structured, whitespace-preserving text representation, akin to ASCII art. This approach aims to provide LLMs, even text-only ones, with a detailed understanding of a webpage's visual layout and interactive components, reportedly outperforming vision-language models on specific benchmarks.
Quick Start & Requirements
pip install tarsier
Highlighted Details
[#ID]
for text inputs, [@ID]
for links, [$ID]
for other interactables.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify the project's license, which is a critical factor for commercial adoption or integration into closed-source projects. Support for Microsoft Azure Computer Vision is listed as "Coming Soon" despite example code and setup instructions for it being present.
8 months ago
1+ week