kimuraframework by vifreefly

Ruby web scraping framework for JS-rendered sites

Created 7 years ago

1,063 stars

Top 35.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Hiroshi Shibata

Core Contributor to Ruby

Project Summary

Kimurai is a Ruby-based web scraping framework designed for extracting data from websites, including those with JavaScript-rendered content. It targets developers needing a robust and flexible tool for web scraping tasks, offering a familiar API based on Capybara and Nokogiri.

How It Works

Kimurai leverages various "engines" for fetching and rendering web pages: Mechanize for simple HTTP requests, Poltergeist (PhantomJS) for JavaScript rendering, and Selenium for Headless Chrome or Firefox. This engine abstraction allows users to switch rendering backends without rewriting their spider logic. The framework provides a Capybara-like interface for interacting with pages (e.g., clicking buttons, filling forms) and a structured approach to defining spiders, requests, and data parsing.

Quick Start & Requirements

Installation: gem install kimurai
Prerequisites: Ruby >= 2.5.0. Installation of browsers (Chrome, Firefox) and their respective drivers (ChromeDriver, GeckoDriver) is required for Selenium engines. PhantomJS is also supported.
Setup: The kimurai setup command automates environment setup on Ubuntu 18.04 using Ansible.
Documentation: https://github.com/vifreefly/kimuraframework

Highlighted Details

Supports headless Chrome, Firefox, PhantomJS, and basic HTTP requests via Mechanize.
Built-in features for handling request errors, retries, and memory limits.
Includes helpers for saving data (JSON, CSV), skipping duplicate URLs, and parallel scraping.
Offers a project mode similar to Scrapy, with generators for spiders, projects, and scheduling.

Maintenance & Community

The project is actively maintained by vifreefly. Community support is available via chat.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While Selenium engines offer robust JavaScript rendering, they can be more resource-intensive than Mechanize. The README notes that Selenium drivers do not support proxies with authorization. The kimurai setup command currently only supports Ubuntu 18.04.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

50 stars in the last 30 days