CLI tool for semantic Markdown conversion optimized for LLMs
Top 41.6% on sourcepulse
This library converts HTML DOM to a semantic Markdown format optimized for Large Language Models (LLMs). It preserves semantic structure, extracts metadata, and reduces token usage, making web content easier for LLMs to process. The target audience includes developers and researchers working with LLMs who need to ingest and analyze web data.
How It Works
The library parses HTML, identifies semantic elements (like headers, footers, nav), extracts metadata (title, OG tags, JSON-LD), and detects main content. It employs URL refification and a concise representation to minimize token count. Table columns are uniquely identified to improve LLM data correlation.
Quick Start & Requirements
npx d2m@latest -u <URL>
import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(document.body);
import {convertHtmlToMarkdown} from 'dom-to-semantic-markdown'; convertHtmlToMarkdown(htmlString, { overrideDOMParser: new dom.window.DOMParser() });
Highlighted Details
Maintenance & Community
CONTRIBUTING.md
.Licensing & Compatibility
Limitations & Caveats
The overrideDOMParser
option is noted for Node.js environments, suggesting potential complexities or specific requirements for server-side DOM parsing. Custom element processing and unhandled element handlers are available, indicating extensibility but also potential need for custom logic for non-standard HTML.
2 months ago
Inactive