dom-to-semantic-markdown  by romansky

CLI tool for semantic Markdown conversion optimized for LLMs

created 1 year ago
885 stars

Top 41.6% on sourcepulse

GitHubView on GitHub
Project Summary

This library converts HTML DOM to a semantic Markdown format optimized for Large Language Models (LLMs). It preserves semantic structure, extracts metadata, and reduces token usage, making web content easier for LLMs to process. The target audience includes developers and researchers working with LLMs who need to ingest and analyze web data.

How It Works

The library parses HTML, identifies semantic elements (like headers, footers, nav), extracts metadata (title, OG tags, JSON-LD), and detects main content. It employs URL refification and a concise representation to minimize token count. Table columns are uniquely identified to improve LLM data correlation.

Quick Start & Requirements

  • CLI: npx d2m@latest -u <URL>
  • Browser: import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(document.body);
  • Node.js: import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(htmlString, { overrideDOMParser: new dom.window.DOMParser() });
  • Dependencies: Node.js environment for CLI/server-side usage.
  • Docs: https://github.com/romansky/dom-to-semantic-markdown

Highlighted Details

  • Semantic structure preservation (headers, footers, nav).
  • Metadata extraction (title, description, keywords, OG, Twitter Cards, JSON-LD).
  • Main content detection.
  • Table column tracking for LLM data correlation.
  • URL refification for token efficiency.

Maintenance & Community

  • Active development indicated by recent commits.
  • Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The overrideDOMParser option is noted for Node.js environments, suggesting potential complexities or specific requirements for server-side DOM parsing. Custom element processing and unhandled element handlers are available, indicating extensibility but also potential need for custom logic for non-standard HTML.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
65 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Feedback? Help us improve.