strip-tags  by simonw

CLI tool for stripping tags from HTML

created 2 years ago
339 stars

Top 82.4% on sourcepulse

GitHubView on GitHub
Project Summary

This tool provides a command-line interface and Python library for stripping HTML tags, offering granular control over which tags and attributes are preserved. It's designed for developers and researchers working with HTML content, particularly when preparing text for language models or cleaning web-scraped data.

How It Works

The tool parses HTML and allows users to specify CSS selectors to target specific content areas for stripping or removal. It supports keeping certain tags and their attributes (like id, class, href, alt) to retain semantic hints for downstream processing, such as feeding data into LLMs. Options for whitespace minification and removing blank lines further refine the output.

Quick Start & Requirements

  • Install via pip: pip install strip-tags
  • Usage: cat input.html | strip-tags > output.txt
  • More info: strip-tags documentation

Highlighted Details

  • Supports CSS selectors for targeted stripping and removal.
  • Options to keep specific tags (-t, --keep-tag) and attribute preservation (--all-attrs).
  • Includes predefined tag bundles for common use cases (headings, metadata, structure, tables, lists).
  • Offers whitespace minification (-m, --minify) and blank line removal.

Maintenance & Community

This project is part of a suite of tools by Simon Willison, known for his work on Datasette and LLM-related utilities.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The tool focuses on tag stripping and attribute management; it does not perform full HTML validation or repair. Complex or malformed HTML might yield unexpected results.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.