sift-kg by juanceresa

Build interactive knowledge graphs from any documents

Created 4 months ago

495 stars

Top 62.1% on SourcePulse

Project Summary

This project addresses the challenge of transforming unstructured documents into structured, explorable knowledge graphs. It targets researchers, analysts, and power users who need to quickly understand complex relationships within large document collections without extensive coding or infrastructure setup. The primary benefit is rapid, interactive knowledge graph generation directly from the command line, enabling deep insights into document connections.

How It Works

sift-kg employs a multi-stage pipeline: documents are first processed for text extraction, supporting over 75 formats and optional OCR for scanned content. An LLM then performs schema discovery, or uses a predefined domain, to identify entity and relation types relevant to the corpus. These are used for extraction, generating a NetworkX-based knowledge graph. A key differentiator is the human-in-the-loop entity resolution, where the LLM proposes merges, but users must approve them via an interactive terminal UI or by editing YAML files, ensuring accuracy and control. The process concludes with an interactive browser-based viewer and export options.

Quick Start & Requirements

Primary install: pip install sift-kg
Prerequisites: Python 3.11+. Requires API keys for LLM providers (OpenAI, Anthropic, Mistral, Ollama, etc.). Optional dependencies for OCR (Tesseract, EasyOCR, PaddleOCR, Google Cloud Vision) and semantic clustering (pip install sift-kg[embeddings]).
Setup: Initialize with sift init, configure API keys in .env, then run sift extract, sift build, and sift view.
Links: Live demos are available but not directly linked in the README.

Highlighted Details

Zero-config start: Simply point the CLI at a folder of documents to generate a knowledge graph.
Flexible LLM integration: Supports OpenAI, Anthropic, Mistral, Ollama, and any LiteLLM-compatible provider.
Schema-free by default: An LLM call samples documents to design a tailored schema, saved for reuse.
Interactive viewer: Features community regions, focus mode, keyboard navigation, and a trail breadcrumb for deep exploration.
Comprehensive export: Outputs to GraphML, GEXF, SQLite, CSV, and native JSON formats.
Source provenance: Every extracted entity and relation links back to its source document and passage.
Broad format support: Handles PDFs, DOCX, XLSX, HTML, images, and 75+ other formats via the Kreuzberg extraction engine.
Controlled deduplication: Never merges entities without explicit user approval.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The tool relies on external LLM APIs, incurring potential costs and requiring API key management. While offering local OCR options, setting up advanced OCR backends or embedding models introduces additional dependencies. The "no code" claim applies to the primary CLI workflow; programmatic use requires Python scripting. For high-accuracy use cases, the human-in-the-loop review process for entity resolution is essential and requires user time.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days