strip-tags  by simonw

CLI tool for stripping tags from HTML

Created 2 years ago
344 stars

Top 80.4% on SourcePulse

GitHubView on GitHub
Project Summary

This tool provides a command-line interface and Python library for stripping HTML tags, offering granular control over which tags and attributes are preserved. It's designed for developers and researchers working with HTML content, particularly when preparing text for language models or cleaning web-scraped data.

How It Works

The tool parses HTML and allows users to specify CSS selectors to target specific content areas for stripping or removal. It supports keeping certain tags and their attributes (like id, class, href, alt) to retain semantic hints for downstream processing, such as feeding data into LLMs. Options for whitespace minification and removing blank lines further refine the output.

Quick Start & Requirements

  • Install via pip: pip install strip-tags
  • Usage: cat input.html | strip-tags > output.txt
  • More info: strip-tags documentation

Highlighted Details

  • Supports CSS selectors for targeted stripping and removal.
  • Options to keep specific tags (-t, --keep-tag) and attribute preservation (--all-attrs).
  • Includes predefined tag bundles for common use cases (headings, metadata, structure, tables, lists).
  • Offers whitespace minification (-m, --minify) and blank line removal.

Maintenance & Community

This project is part of a suite of tools by Simon Willison, known for his work on Datasette and LLM-related utilities.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The tool focuses on tag stripping and attribute management; it does not perform full HTML validation or repair. Complex or malformed HTML might yield unexpected results.

Health Check
Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research).

poml by microsoft

1.4%
4k
Structured prompting for LLMs
Created 9 months ago
Updated 1 day ago
Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
1 more.

MinerU by opendatalab

1.2%
44k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.