knowledge-table by whyhow-ai

Open-source package for structured data extraction from unstructured documents

Created 1 year ago

658 stars

Top 51.1% on SourcePulse

Project Summary

Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents, targeting both business users and developers. It enables natural language querying for data extraction, creating structured knowledge representations like tables and graphs, with features for data traceability and customizable extraction rules.

How It Works

The system processes unstructured documents by chunking them, embedding the chunks, and storing them in a vector database (Milvus or Qdrant supported). Users define extraction goals via "Questions," which are natural language prompts. Customizable "Rules" (May Return, Must Return, Allowed # of Responses, Resolve Entity) guide the LLM's extraction process, ensuring data quality and consistency. Extracted data can be formatted, filtered, and exported as CSV or graph triples, with support for chained extractions and splitting cells into rows.

Quick Start & Requirements

Docker: docker-compose up -d --build (Access frontend at http://localhost:3000, backend at http://localhost:8000)
Native: Python 3.10+, Bun (frontend). Requires cloning the repo, setting up a Python virtual environment, installing dependencies (pip install . or pip install .[dev]), and starting backend (uvicorn app.main:app) and frontend (bun start).
Environment: Requires an OpenAI API key and configuration for the vector store (Milvus/Qdrant) in a .env file. Optional Unstructured API integration requires an API key and pip install .[unstructured].

Highlighted Details

Natural language querying for data extraction.
Customizable extraction rules (May Return, Must Return, Allowed # of Responses, Resolve Entity).
Data traceability via chunk linking and provenance in UI.
Export to CSV or graph triples, with LLM-generated schema for triples.

Maintenance & Community

The project is developed by WhyHow.AI. Support and community engagement are available via email (team@whyhow.ai) and Discord. A roadmap is provided, with several features already implemented and others planned.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Currently relies on OpenAI as the default LLM provider, with plans to support Azure OpenAI, Llama3, GPT-4, and Anthropic. Vector database support is limited to Milvus and Qdrant, with plans to add Weaviate, Chroma, and Pinecone. Backend data store options are limited, with PostgreSQL, MongoDB, MySQL, and Redis planned. Deployment scripts for cloud environments are not yet available.

knowledge-table by whyhow-ai

Explore Similar Projects

docai by PragmaticMachineLearning

Document-Parser-Agent by Micheliliuv87

introspect by defog-ai

doctran by finic-ai

documind by DocumindHQ

spacy-layout by explosion

qsv by dathere

kg-gen by stair-lab

docext by NanoNets

open-parse by Filimoa

llmsherpa by nlmatics

ragflow by infiniflow