dom-to-semantic-markdown by romansky

CLI tool for semantic Markdown conversion optimized for LLMs

Created 1 year ago

948 stars

Top 38.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

John Resig

Author of jQuery; Chief Software Architect at Khan Academy

Jack Lukic

Author of Semantic UI

Project Summary

This library converts HTML DOM to a semantic Markdown format optimized for Large Language Models (LLMs). It preserves semantic structure, extracts metadata, and reduces token usage, making web content easier for LLMs to process. The target audience includes developers and researchers working with LLMs who need to ingest and analyze web data.

How It Works

The library parses HTML, identifies semantic elements (like headers, footers, nav), extracts metadata (title, OG tags, JSON-LD), and detects main content. It employs URL refification and a concise representation to minimize token count. Table columns are uniquely identified to improve LLM data correlation.

Quick Start & Requirements

CLI: npx d2m@latest -u <URL>
Browser: import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(document.body);
Node.js: import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(htmlString, { overrideDOMParser: new dom.window.DOMParser() });
Dependencies: Node.js environment for CLI/server-side usage.
Docs: https://github.com/romansky/dom-to-semantic-markdown

Highlighted Details

Semantic structure preservation (headers, footers, nav).
Metadata extraction (title, description, keywords, OG, Twitter Cards, JSON-LD).
Main content detection.
Table column tracking for LLM data correlation.
URL refification for token efficiency.

Maintenance & Community

Active development indicated by recent commits.
Contribution guidelines available in CONTRIBUTING.md.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The overrideDOMParser option is noted for Node.js environments, suggesting potential complexities or specific requirements for server-side DOM parsing. Custom element processing and unhandled element handlers are available, indicating extensibility but also potential need for custom logic for non-standard HTML.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days