knowledge-table  by whyhow-ai

Open-source package for structured data extraction from unstructured documents

created 10 months ago
596 stars

Top 55.4% on sourcepulse

GitHubView on GitHub
Project Summary

Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents, targeting both business users and developers. It enables natural language querying for data extraction, creating structured knowledge representations like tables and graphs, with features for data traceability and customizable extraction rules.

How It Works

The system processes unstructured documents by chunking them, embedding the chunks, and storing them in a vector database (Milvus or Qdrant supported). Users define extraction goals via "Questions," which are natural language prompts. Customizable "Rules" (May Return, Must Return, Allowed # of Responses, Resolve Entity) guide the LLM's extraction process, ensuring data quality and consistency. Extracted data can be formatted, filtered, and exported as CSV or graph triples, with support for chained extractions and splitting cells into rows.

Quick Start & Requirements

  • Docker: docker-compose up -d --build (Access frontend at http://localhost:3000, backend at http://localhost:8000)
  • Native: Python 3.10+, Bun (frontend). Requires cloning the repo, setting up a Python virtual environment, installing dependencies (pip install . or pip install .[dev]), and starting backend (uvicorn app.main:app) and frontend (bun start).
  • Environment: Requires an OpenAI API key and configuration for the vector store (Milvus/Qdrant) in a .env file. Optional Unstructured API integration requires an API key and pip install .[unstructured].

Highlighted Details

  • Natural language querying for data extraction.
  • Customizable extraction rules (May Return, Must Return, Allowed # of Responses, Resolve Entity).
  • Data traceability via chunk linking and provenance in UI.
  • Export to CSV or graph triples, with LLM-generated schema for triples.

Maintenance & Community

The project is developed by WhyHow.AI. Support and community engagement are available via email (team@whyhow.ai) and Discord. A roadmap is provided, with several features already implemented and others planned.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Currently relies on OpenAI as the default LLM provider, with plans to support Azure OpenAI, Llama3, GPT-4, and Anthropic. Vector database support is limited to Milvus and Qdrant, with plans to add Weaviate, Chroma, and Pinecone. Backend data store options are limited, with PostgreSQL, MongoDB, MySQL, and Redis planned. Deployment scripts for cloud environments are not yet available.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.