agent-fetch by teng-lin

Full-content web fetcher for AI agents and content workflows

Created 5 months ago

297 stars

Top 89.1% on SourcePulse

Project Summary

Summary

agent-fetch tackles the challenge of retrieving complete, clean web content for AI agents and content workflows, often hindered by server-side fingerprinting that truncates standard HTTP tool responses. It offers a local, API-key-free solution employing browser impersonation and multiple extraction strategies to deliver full article text with preserved structure and links. This benefits AI agents, RAG pipelines, and LLM applications needing rich, accurate web data beyond summaries.

How It Works

The core approach uses customizable TLS fingerprints for browser impersonation to evade server detection. Upon connection, it runs multiple extraction strategies in parallel—Readability, text density, JSON-LD, framework-specific extractors (Next.js, RSC, WP API), and CSS selectors. This multi-pronged method ensures comprehensive content retrieval across diverse architectures. The most complete result is selected, with metadata intelligently composed from the best source, providing a robust alternative to less capable fetchers or cloud APIs.

Quick Start & Requirements

Installation is via npm install @teng-lin/agent-fetch or direct execution with npx agent-fetch <url>. The tool operates locally, requiring Node.js/npm. No specific hardware, GPU, or API keys are mandated. AI agent integration is supported via npx skills add teng-lin/agent-fetch.

Highlighted Details

Advanced TLS fingerprinting presets (e.g., chrome-143, ios-safari-18) for browser impersonation.
Robust multi-strategy extraction: Readability, JSON-LD, Next.js/RSC, WP API, text density, custom CSS selectors.
Multi-page crawling with depth control, URL filtering (--include, --exclude), concurrency limits, and rate limiting (--delay).
Support for authenticated sessions via inline or Netscape cookie files (--cookie, --cookie-file).
Flexible output formats: structured Markdown, JSON, plain text, or raw HTML.
Content extraction from local PDF files.

Maintenance & Community

The provided README lacks specific details on maintainers, community channels (Discord/Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

Released under the permissive MIT license, allowing broad compatibility for commercial and closed-source applications without significant restrictions.

Limitations & Caveats

Users must comply with website Terms of Service and robots.txt; the tool grants no permissions or bypasses access controls. Legal responsibility for copyright and data protection rests with the user. Extraction success may vary on sites employing highly sophisticated anti-scraping techniques beyond TLS fingerprinting.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days