kimuraframework  by vifreefly

Ruby web scraping framework for JS-rendered sites

created 7 years ago
1,012 stars

Top 37.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kimurai is a Ruby-based web scraping framework designed for extracting data from websites, including those with JavaScript-rendered content. It targets developers needing a robust and flexible tool for web scraping tasks, offering a familiar API based on Capybara and Nokogiri.

How It Works

Kimurai leverages various "engines" for fetching and rendering web pages: Mechanize for simple HTTP requests, Poltergeist (PhantomJS) for JavaScript rendering, and Selenium for Headless Chrome or Firefox. This engine abstraction allows users to switch rendering backends without rewriting their spider logic. The framework provides a Capybara-like interface for interacting with pages (e.g., clicking buttons, filling forms) and a structured approach to defining spiders, requests, and data parsing.

Quick Start & Requirements

  • Installation: gem install kimurai
  • Prerequisites: Ruby >= 2.5.0. Installation of browsers (Chrome, Firefox) and their respective drivers (ChromeDriver, GeckoDriver) is required for Selenium engines. PhantomJS is also supported.
  • Setup: The kimurai setup command automates environment setup on Ubuntu 18.04 using Ansible.
  • Documentation: https://github.com/vifreefly/kimuraframework

Highlighted Details

  • Supports headless Chrome, Firefox, PhantomJS, and basic HTTP requests via Mechanize.
  • Built-in features for handling request errors, retries, and memory limits.
  • Includes helpers for saving data (JSON, CSV), skipping duplicate URLs, and parallel scraping.
  • Offers a project mode similar to Scrapy, with generators for spiders, projects, and scheduling.

Maintenance & Community

The project is actively maintained by vifreefly. Community support is available via chat.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While Selenium engines offer robust JavaScript rendering, they can be more resource-intensive than Mechanize. The README notes that Selenium drivers do not support proxies with authorization. The kimurai setup command currently only supports Ubuntu 18.04.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Clément Renault Clément Renault(Cofounder of Meilisearch), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
1 more.

browser by lightpanda-io

0.4%
9k
Headless browser for AI/automation tasks, scraping, and LLM training
created 2 years ago
updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.