KG-Pipeline  by FareedKhan-dev

LLM-powered pipeline for text-to-knowledge graph conversion

Created 6 months ago
252 stars

Top 99.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an end-to-end Python pipeline for converting unstructured text into interactive knowledge graphs. It targets beginners to intermediate Python users interested in NLP, knowledge graphs, and LLMs, offering a clear, step-by-step demonstration of the data transformation process and enabling visualization of complex relationships.

How It Works

The pipeline leverages Large Language Models (LLMs) to extract Subject-Predicate-Object (SPO) triples from input text. It breaks down longer documents into smaller chunks to manage LLM context limits. The core approach uses the openai library for LLM interaction, networkx to build the graph data structure, and ipycytoscape for interactive, in-notebook visualization of the resulting knowledge graph. This granular, step-by-step methodology emphasizes transparency and educational value, allowing users to observe data evolution at each stage.

Quick Start & Requirements

  • Primary Install: pip install openai networkx "ipycytoscape>=1.3.1" ipywidgets pandas. A kernel/runtime restart is typically required post-installation.
  • Prerequisites:
    • Python 3 environment.
    • LLM API access (e.g., OpenAI, Ollama, Nebius AI) configured via environment variables (OPENAI_API_KEY, OPENAI_API_BASE).
    • Jupyter Notebook or JupyterLab for interactive visualization.
    • ipywidgets extension may need explicit enabling in classic Jupyter Notebook.
  • Setup Time: Minimal for library installation; LLM API key configuration is straightforward.
  • Links: No external quick-start guides or demo links are provided beyond library installation commands.

Highlighted Details

  • End-to-End Workflow: Demonstrates the complete process from raw text ingestion to a fully visualized knowledge graph.
  • Granular Process Visibility: Each step, including text chunking, LLM extraction, normalization, and graph construction, is explicitly shown with intermediate outputs.
  • Interactive Visualization: Employs ipycytoscape to render dynamic, explorable graphs directly within the notebook environment.
  • Detailed Prompt Engineering: Includes well-defined system and user prompts for LLM interaction, focusing on accurate SPO triple extraction, lowercase output, and pronoun resolution.
  • Data Transformation Tracking: Outputs raw LLM responses, parsed JSON, and normalized triples, providing insight into data evolution.

Maintenance & Community

The provided README does not contain information regarding maintainers, notable contributors, community channels (e.g., Discord, Slack), roadmaps, sponsorships, or partnerships.

Licensing & Compatibility

The README does not specify a software license. This absence prevents clear determination of usage rights, modification permissions, and compatibility for commercial or closed-source integration.

Limitations & Caveats

The pipeline is dependent on external LLM services, requiring API keys and potentially incurring usage costs. LLM output quality can vary, necessitating careful prompt tuning or robust error handling for production use. Text chunking is required for longer documents due to LLM context limitations. Interactive visualization is primarily designed for Jupyter environments. The project's scalability for extremely large datasets or complex graph analysis is not detailed, and the lack of a specified license is a significant adoption blocker.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.