Open-source package for structured data extraction from unstructured documents
Top 55.4% on sourcepulse
Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents, targeting both business users and developers. It enables natural language querying for data extraction, creating structured knowledge representations like tables and graphs, with features for data traceability and customizable extraction rules.
How It Works
The system processes unstructured documents by chunking them, embedding the chunks, and storing them in a vector database (Milvus or Qdrant supported). Users define extraction goals via "Questions," which are natural language prompts. Customizable "Rules" (May Return, Must Return, Allowed # of Responses, Resolve Entity) guide the LLM's extraction process, ensuring data quality and consistency. Extracted data can be formatted, filtered, and exported as CSV or graph triples, with support for chained extractions and splitting cells into rows.
Quick Start & Requirements
docker-compose up -d --build
(Access frontend at http://localhost:3000
, backend at http://localhost:8000
)pip install .
or pip install .[dev]
), and starting backend (uvicorn app.main:app
) and frontend (bun start
)..env
file. Optional Unstructured API integration requires an API key and pip install .[unstructured]
.Highlighted Details
Maintenance & Community
The project is developed by WhyHow.AI. Support and community engagement are available via email (team@whyhow.ai
) and Discord. A roadmap is provided, with several features already implemented and others planned.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Currently relies on OpenAI as the default LLM provider, with plans to support Azure OpenAI, Llama3, GPT-4, and Anthropic. Vector database support is limited to Milvus and Qdrant, with plans to add Weaviate, Chroma, and Pinecone. Backend data store options are limited, with PostgreSQL, MongoDB, MySQL, and Redis planned. Deployment scripts for cloud environments are not yet available.
8 months ago
1 day