reader  by jina-ai

LLM input converter via URL

created 1 year ago
9,020 stars

Top 5.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a free, stable, and scalable API service for converting web content into a format suitable for Large Language Models (LLMs) and for performing web searches. It targets developers building LLM-powered agents and RAG systems, offering enhanced input quality and access to real-time information.

How It Works

The service operates via two main endpoints: r.jina.ai for content retrieval and s.jina.ai for web search. r.jina.ai fetches content from any URL, processing it for LLM consumption, including handling JavaScript-heavy Single Page Applications (SPAs) via Puppeteer and headless Chrome. s.jina.ai performs web searches, retrieves the top 5 results, and then applies the r.jina.ai processing to each, providing richer context than typical search engine API snippets.

Quick Start & Requirements

  • Installation: Clone the repository (git clone git@github.com:jina-ai/reader.git) and run npm install.
  • Prerequisites: Node.js v18 (versions >18 may cause build failures).
  • Usage: Access via https://r.jina.ai/<your_url> for content reading or https://s.jina.ai/<your_query> for web search.
  • Demo: Live demo available at the project's GitHub repository.

Highlighted Details

  • Supports fetching content from Single Page Applications (SPAs) using headless Chrome.
  • Offers advanced control via request headers for features like image captioning, cookie forwarding, proxying, and content selection.
  • s.jina.ai provides full content of top search results, not just snippets.
  • Adaptive crawler introduced for recursive website crawling and relevant page extraction.
  • PDF and image content extraction capabilities are supported.

Maintenance & Community

The service is actively maintained by Jina AI as a core product. Updates are deployed directly from commits to this repository. Users can report issues with specific URLs.

Licensing & Compatibility

Licensed under Apache-2.0. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The build process is explicitly stated to fail for Node.js versions greater than v18. While the API is generally stable, past DDoS attacks have been noted, with recent improvements to reliability. Some SPAs may require specific header configurations (e.g., x-timeout, x-wait-for-selector) for optimal content capture.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
397 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.