Photo-agents  by jmerelnyc

Autonomous, vision-grounded LLM agents for computer operation

Created 2 weeks ago

New!

1,007 stars

Top 36.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Photo Agents provides a runtime for autonomous, self-evolving LLM agents designed to operate computer systems by grounding their reasoning in visual screen content. Targeting developers and power users, it enables agents to perceive, reason, and act directly on the UI, offering local execution for data privacy and self-written skills for adaptive functionality.

How It Works

The system employs a streaming agent loop built around a perceive → reason → act cycle. It prioritizes vision-grounded memory, storing observations in biological-inspired layers rather than text transcripts. Skills are generated autonomously by the agent itself based on successful task execution, fostering self-evolution and enabling effective UI interaction through visual context.

Quick Start & Requirements

Installation: pip install photoagents or pip install "photoagents[all]". Requires Python 3.10+. A Photo Agents API key, validated via https://photo-agents.com, is mandatory. LLM provider credentials (OpenAI, Anthropic) must be configured (e.g., credentials.py). Run interactively via python -m photoagents, or launch GUI clients like Streamlit (pythonw -m photoagents.cli.launcher) or PyQt (python -m photoagents.clients.desktop_app).

Highlighted Details

  • Multi-provider LLM router supporting Anthropic Claude and OpenAI GPT.
  • Physical-execution toolset: file I/O, sandboxed code execution (Python, bash, PowerShell), browser automation (CDP).
  • Layered memory system (working, global, SOP, session archive).
  • Pluggable clients for Streamlit, PyQt, Telegram, Feishu, WeCom, DingTalk.
  • Optional Langfuse observability and cron-style scheduler.

Maintenance & Community

Project website: https://photo-agents.com. Active X/Twitter presence (https://x.com/photoagents) for updates and demos. Specific details on core maintainers, sponsorships, or dedicated community channels (Discord/Slack) are not provided in the README.

Licensing & Compatibility

Released under the MIT license, permitting broad usage, including commercial applications and linking within closed-source projects.

Limitations & Caveats

The software is in beta, with APIs subject to change before 1.0. A remote-validated API key is a prerequisite for runtime operation, serving as an accountability gate.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1,042 stars in the last 19 days

Explore Similar Projects

Starred by Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), Anton Osika Anton Osika(Cofounder of Lovable), and
3 more.

gptme by gptme

0.1%
4k
CLI tool for terminal agent workflows
Created 3 years ago
Updated 16 hours ago
Feedback? Help us improve.