Docs2KG  by AI4WA

CLI tool for knowledge graph construction from documents

Created 1 year ago
327 stars

Top 83.4% on SourcePulse

GitHubView on GitHub
Project Summary

Docs2KG offers a unified approach to constructing knowledge graphs from diverse document types, targeting researchers and developers who need to extract structured information from unstructured text. It leverages a human-LLM collaborative framework to improve the quality and efficiency of knowledge graph generation.

How It Works

Docs2KG employs a hybrid bottom-up and top-down strategy, integrating Large Language Models (LLMs) for knowledge graph and ontology construction. It categorizes knowledge into MetaKG (document metadata), LayoutKG (document structure), and SemanticKG (content entities and relations). A key feature is its human-LLM collaborative interface, enabling iterative refinement of the knowledge graph based on human feedback, which in turn enhances the LLM's performance.

Quick Start & Requirements

  • Installation: pip install Docs2KG and python -m spacy download en_core_web_sm.
  • Prerequisites: Python environment, spaCy English model. LLM access (e.g., via Ollama) is required for agent-based processing.
  • Usage: Set CONFIG_FILE environment variable. Commands include docs2kg process-document, docs2kg batch-process, docs2kg list-formats, and docs2kg neo4j.
  • Documentation: Detailed setup and tutorials are available in the official documentation.

Highlighted Details

  • Supports heterogeneous document formats: PDF, DOCX, HTML, EPUB.
  • Provides a human-LLM collaborative interface for iterative knowledge graph refinement.
  • Outputs knowledge graphs suitable for downstream applications like RAG.
  • Includes metrics for evaluating automatic construction quality.

Maintenance & Community

The project is associated with AI4WA. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. It provides an arXiv citation, suggesting it is research-oriented. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research contribution (arXiv:2406.02962), implying it may be in an early stage of development. Specific limitations regarding supported LLMs, scalability, or robustness in production environments are not detailed.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.