kimuraframework  by vifreefly

Ruby web scraping framework for JS-rendered sites

Created 7 years ago
1,011 stars

Top 37.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kimurai is a Ruby-based web scraping framework designed for extracting data from websites, including those with JavaScript-rendered content. It targets developers needing a robust and flexible tool for web scraping tasks, offering a familiar API based on Capybara and Nokogiri.

How It Works

Kimurai leverages various "engines" for fetching and rendering web pages: Mechanize for simple HTTP requests, Poltergeist (PhantomJS) for JavaScript rendering, and Selenium for Headless Chrome or Firefox. This engine abstraction allows users to switch rendering backends without rewriting their spider logic. The framework provides a Capybara-like interface for interacting with pages (e.g., clicking buttons, filling forms) and a structured approach to defining spiders, requests, and data parsing.

Quick Start & Requirements

  • Installation: gem install kimurai
  • Prerequisites: Ruby >= 2.5.0. Installation of browsers (Chrome, Firefox) and their respective drivers (ChromeDriver, GeckoDriver) is required for Selenium engines. PhantomJS is also supported.
  • Setup: The kimurai setup command automates environment setup on Ubuntu 18.04 using Ansible.
  • Documentation: https://github.com/vifreefly/kimuraframework

Highlighted Details

  • Supports headless Chrome, Firefox, PhantomJS, and basic HTTP requests via Mechanize.
  • Built-in features for handling request errors, retries, and memory limits.
  • Includes helpers for saving data (JSON, CSV), skipping duplicate URLs, and parallel scraping.
  • Offers a project mode similar to Scrapy, with generators for spiders, projects, and scheduling.

Maintenance & Community

The project is actively maintained by vifreefly. Community support is available via chat.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While Selenium engines offer robust JavaScript rendering, they can be more resource-intensive than Mechanize. The README notes that Selenium drivers do not support proxies with authorization. The kimurai setup command currently only supports Ubuntu 18.04.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Magnus Müller Magnus Müller(Cofounder of Browser Use), and
4 more.

web-ui by browser-use

0.3%
15k
Web UI for AI browser agent
Created 8 months ago
Updated 2 weeks ago
Feedback? Help us improve.