CLI tool for stripping tags from HTML
Top 82.4% on sourcepulse
This tool provides a command-line interface and Python library for stripping HTML tags, offering granular control over which tags and attributes are preserved. It's designed for developers and researchers working with HTML content, particularly when preparing text for language models or cleaning web-scraped data.
How It Works
The tool parses HTML and allows users to specify CSS selectors to target specific content areas for stripping or removal. It supports keeping certain tags and their attributes (like id
, class
, href
, alt
) to retain semantic hints for downstream processing, such as feeding data into LLMs. Options for whitespace minification and removing blank lines further refine the output.
Quick Start & Requirements
pip install strip-tags
cat input.html | strip-tags > output.txt
Highlighted Details
-t
, --keep-tag
) and attribute preservation (--all-attrs
).-m
, --minify
) and blank line removal.Maintenance & Community
This project is part of a suite of tools by Simon Willison, known for his work on Datasette and LLM-related utilities.
Licensing & Compatibility
The project is released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The tool focuses on tag stripping and attribute management; it does not perform full HTML validation or repair. Complex or malformed HTML might yield unexpected results.
5 months ago
1 day