reader by jina-ai

LLM input converter via URL

Created 1 year ago

9,656 stars

Top 5.2% on SourcePulse

View on GitHub

8 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 4 more!

Project Summary

This project provides a free, stable, and scalable API service for converting web content into a format suitable for Large Language Models (LLMs) and for performing web searches. It targets developers building LLM-powered agents and RAG systems, offering enhanced input quality and access to real-time information.

How It Works

The service operates via two main endpoints: r.jina.ai for content retrieval and s.jina.ai for web search. r.jina.ai fetches content from any URL, processing it for LLM consumption, including handling JavaScript-heavy Single Page Applications (SPAs) via Puppeteer and headless Chrome. s.jina.ai performs web searches, retrieves the top 5 results, and then applies the r.jina.ai processing to each, providing richer context than typical search engine API snippets.

Quick Start & Requirements

Installation: Clone the repository (git clone git@github.com:jina-ai/reader.git) and run npm install.
Prerequisites: Node.js v18 (versions >18 may cause build failures).
Usage: Access via https://r.jina.ai/<your_url> for content reading or https://s.jina.ai/<your_query> for web search.
Demo: Live demo available at the project's GitHub repository.

Highlighted Details

Supports fetching content from Single Page Applications (SPAs) using headless Chrome.
Offers advanced control via request headers for features like image captioning, cookie forwarding, proxying, and content selection.
s.jina.ai provides full content of top search results, not just snippets.
Adaptive crawler introduced for recursive website crawling and relevant page extraction.
PDF and image content extraction capabilities are supported.

Maintenance & Community

The service is actively maintained by Jina AI as a core product. Updates are deployed directly from commits to this repository. Users can report issues with specific URLs.

Licensing & Compatibility

Licensed under Apache-2.0. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The build process is explicitly stated to fail for Node.js versions greater than v18. While the API is generally stable, past DDoS attacks have been noted, with recent improvements to reliability. Some SPAs may require specific header configurations (e.g., x-timeout, x-wait-for-selector) for optimal content capture.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

195 stars in the last 30 days